[00:01:04] RECOVERY - puppet last run on db2051 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:02:00] (03CR) 10Tim Landscheidt: "Why did it work on toolsbeta-puppetmaster3, though? They share the same validate_re() unless the cluster puppet master hasn't been restar" [puppet] - 10https://gerrit.wikimedia.org/r/226652 (owner: 10Yuvipanda) [00:02:33] RECOVERY - puppet last run on cerium is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [00:03:16] (03CR) 10Yuvipanda: "Not sure, and the cluster puppetmaster has definitely been restarted" [puppet] - 10https://gerrit.wikimedia.org/r/226652 (owner: 10Yuvipanda) [00:04:33] RECOVERY - puppet last run on xenon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:08:24] RECOVERY - puppet last run on restbase1004 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:08:49] ori: fyi, I have no deployment key yet, so I should put this off another few days. [00:09:14] awight: 'sokay. That's probably wise. [00:12:04] ori: You've lit the way, though! Soon this will be a story we tell around the evil-smelling fire [00:14:47] 7Puppet, 6operations: Test that/which self-hosted puppet masters resemble production - https://phabricator.wikimedia.org/T106768#1477537 (10scfc) 3NEW [00:16:14] (03PS1) 10Ori.livneh: Add ProxyPass rule for thumb_handler.php [puppet] - 10https://gerrit.wikimedia.org/r/226658 (https://phabricator.wikimedia.org/T84842) [00:17:18] (03CR) 10Ori.livneh: [C: 032] Add ProxyPass rule for thumb_handler.php [puppet] - 10https://gerrit.wikimedia.org/r/226658 (https://phabricator.wikimedia.org/T84842) (owner: 10Ori.livneh) [00:17:35] matanya: I think I used the right key. I did specify one. I don't think I mixed them up, but I'll double check. [00:18:09] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1477560 (10TJones) @Matanya - good to know it's live. At least then I know the problem is on my end. Thanks. [00:19:59] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1477567 (10Krenair) What error are you getting when trying to connect? [00:20:31] 7Puppet, 6operations: Test that/which self-hosted puppet masters resemble production - https://phabricator.wikimedia.org/T106768#1477568 (10scfc) (Though I did only test that change by default class parameters and not faking Hiera configuration, so that may be the culprit there … But as the production puppet m... [00:21:20] !log Re-enabled Puppet on mw1153 [00:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:18] any luck Trey314159? [00:44:13] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1477735 (10scfc) [00:44:13] 7Puppet, 6operations: Test that/which self-hosted puppet masters resemble production - https://phabricator.wikimedia.org/T106768#1477732 (10scfc) 5Open>3declined a:3scfc # I retested the change with a fake Hiera configuration and it failed, so indeed my method for testing was faulty. # On second thou... [00:44:49] (03CR) 10Ori.livneh: [C: 04-1] "Don't use dashes in resource names; use underscores instead." [puppet] - 10https://gerrit.wikimedia.org/r/226507 (owner: 10Muehlenhoff) [00:45:57] (03PS2) 10Ori.livneh: Add ferm rules for HHVM admin site [puppet] - 10https://gerrit.wikimedia.org/r/226507 (owner: 10Muehlenhoff) [00:47:03] (03CR) 10Ori.livneh: [C: 04-1] "Underscores, not dashes, in resource names! :)" [puppet] - 10https://gerrit.wikimedia.org/r/226506 (https://phabricator.wikimedia.org/T104972) (owner: 10Muehlenhoff) [00:47:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [00:48:23] (03CR) 10Tim Landscheidt: "JFTR: My method of testing was faulty. With faking a Hiera configuration (verbatim "role::labs::instance::testarg: 16" added to hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/226652 (owner: 10Yuvipanda) [00:49:25] (03CR) 10Tim Landscheidt: [C: 04-1] "After the failure of https://gerrit.wikimedia.org/r/#/c/226652/, I need to retest this with an array passed by Hiera." [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [00:49:56] (03PS2) 10Ori.livneh: statsdlb: Fix strict puppet-lint check [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [00:53:44] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:23:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [01:24:49] Krenair: thanks for asking, I haven't tried again this evening. So no luck, good or bad, yet. [01:31:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:41:46] 6operations, 6Labs: lvm 'others20150715' snapshot full on labstore1001 - https://phabricator.wikimedia.org/T106601#1477876 (10yuvipanda) 5Open>3Resolved a:3yuvipanda The snapshot has been deleted by @Coren [02:02:31] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-24 02:02:31+00:00 [02:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:55] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1477907 (10Springle) @jcrespo, no, I did not out-of-band change or use skip counter. I found the machine exactly as you described on IRC, and only did research by dumping logs to ge... [02:06:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 24 02:06:41 UTC 2015 (duration 6m 40s) [02:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:15:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [02:25:20] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 07m 12s) [02:25:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:49] !log restarting restbase on restbase1006 [02:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:25] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-24 02:29:25+00:00 [02:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:32] (03PS7) 10Yuvipanda: The basic RuboCop configuration [puppet] - 10https://gerrit.wikimedia.org/r/218389 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [02:59:02] (03PS8) 10Yuvipanda: rubocop: Add basic configuration [puppet] - 10https://gerrit.wikimedia.org/r/218389 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [02:59:27] (03PS3) 10Ori.livneh: Fix graphite keys in API dashboard [puppet] - 10https://gerrit.wikimedia.org/r/223659 (https://phabricator.wikimedia.org/T85841) (owner: 10Gergő Tisza) [02:59:34] (03CR) 10Ori.livneh: [C: 032] Fix graphite keys in API dashboard [puppet] - 10https://gerrit.wikimedia.org/r/223659 (https://phabricator.wikimedia.org/T85841) (owner: 10Gergő Tisza) [03:00:12] (03PS2) 10Yuvipanda: Fixed Style/TrailingWhitespace RuboCop offense [puppet] - 10https://gerrit.wikimedia.org/r/225238 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [03:00:29] (03PS3) 10Yuvipanda: rubocop: Fixed Style/TrailingWhitespace offense [puppet] - 10https://gerrit.wikimedia.org/r/225238 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [03:00:42] (03CR) 10Yuvipanda: [C: 032] rubocop: Add basic configuration [puppet] - 10https://gerrit.wikimedia.org/r/218389 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [03:00:54] (03CR) 10Yuvipanda: [C: 032] rubocop: Fixed Style/TrailingWhitespace offense [puppet] - 10https://gerrit.wikimedia.org/r/225238 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [03:28:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [03:44:23] 7Puppet, 6Phabricator, 5Patch-For-Review: Create puppet role for Phabricator hosted repo testing - https://phabricator.wikimedia.org/T104827#1478054 (10yuvipanda) [03:59:30] 6operations, 6Labs: New instances stuck unable to run puppet (and no sshing in!) - https://phabricator.wikimedia.org/T101916#1478074 (10yuvipanda) 5Open>3Resolved a:3yuvipanda New images were built. [04:04:54] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 11.11% of data above the critical threshold [100000000.0] [04:05:34] gj YuviPanda [04:09:45] Reedy: yw [04:12:09] ori: !log LocalisationUpdate failed (1.26wmf15) at 2015-07-24 02:02:31+00:00 [04:12:15] have you looked at that yet? [04:19:37] 6operations, 6Labs: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1478148 (10yuvipanda) 5Open>3Invalid a:3yuvipanda Marking as invalid because there's no unpuppetized (or otherwise) bond0 now. [04:19:45] 7Puppet: Puppet Trebuchet provider compares refname with commit sha1 and does NOT refresh the git repo! - https://phabricator.wikimedia.org/T77002#1478151 (10yuvipanda) [04:20:45] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [04:26:34] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:11:39] (03PS4) 10Ori.livneh: Fix graphite keys in API dashboard [puppet] - 10https://gerrit.wikimedia.org/r/223659 (https://phabricator.wikimedia.org/T85841) (owner: 10Gergő Tisza) [05:11:47] (03CR) 10Ori.livneh: [V: 032] Fix graphite keys in API dashboard [puppet] - 10https://gerrit.wikimedia.org/r/223659 (https://phabricator.wikimedia.org/T85841) (owner: 10Gergő Tisza) [05:37:59] (03PS3) 10Ori.livneh: WIP: Add etcd configuration client [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 [05:52:27] !log Added rl-test.php on testwiki (mw1017) to gather stats about cache-control rollover (Catrope, Krinkle). Used by testwiki/test2wiki/mediawikiwiki Common.js (sampled). See T105255. [05:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:53:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 24 05:53:16 UTC 2015 (duration 53m 15s) [05:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:11:01] (03PS2) 10Awight: Try pointing to a /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) [06:19:12] (03PS3) 10Awight: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) [06:20:08] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1478216 (10jcrespo) p:5Triage>3High [06:32:04] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on cp3017 is CRITICAL Puppet has 1 failures [06:33:23] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:33:33] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 1 failures [06:34:23] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:28] (03PS4) 10Ori.livneh: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [06:35:34] (03CR) 10Ori.livneh: [C: 031] Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [06:35:49] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1478223 (10jcrespo) Tests I would do: * What happens if we do a CHANGE MASTER to the most last DELETE and START REPLICATION UNTIL the following event id? We could even activate the... [06:43:02] (03CR) 10Jcrespo: "The bug was found and this patch reverted on gerrit:226559 Commenting here to have a reference." [puppet] - 10https://gerrit.wikimedia.org/r/226534 (owner: 10Muehlenhoff) [06:56:13] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:58:14] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Enable ferm for mc2* systems in codfw [puppet] - 10https://gerrit.wikimedia.org/r/226065 (owner: 10Muehlenhoff) [07:09:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "What ori said :)" [puppet] - 10https://gerrit.wikimedia.org/r/226506 (https://phabricator.wikimedia.org/T104972) (owner: 10Muehlenhoff) [07:10:15] (03CR) 10Giuseppe Lavagetto: [C: 031] Add ferm rules for HHVM admin site [puppet] - 10https://gerrit.wikimedia.org/r/226507 (owner: 10Muehlenhoff) [07:11:53] (03CR) 10Giuseppe Lavagetto: [C: 031] add ferm rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/222554 (owner: 10Muehlenhoff) [07:21:43] (03PS5) 10Giuseppe Lavagetto: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [07:22:53] (03CR) 10jenkins-bot: [V: 04-1] add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [07:36:15] (03PS6) 10Giuseppe Lavagetto: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [07:36:58] (03CR) 10jenkins-bot: [V: 04-1] add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [07:45:20] (03PS3) 10Muehlenhoff: add ferm rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/222554 [07:45:31] (03PS7) 10Giuseppe Lavagetto: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [07:45:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] add ferm rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/222554 (owner: 10Muehlenhoff) [07:57:01] (03PS3) 10Muehlenhoff: Add ferm rules for HHVM admin site [puppet] - 10https://gerrit.wikimedia.org/r/226507 [07:57:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for HHVM admin site [puppet] - 10https://gerrit.wikimedia.org/r/226507 (owner: 10Muehlenhoff) [07:58:16] (03PS8) 10Giuseppe Lavagetto: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [08:00:10] (03CR) 10Giuseppe Lavagetto: [C: 032] add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [08:18:46] (03PS2) 10Muehlenhoff: Enable ferm for mc2* systems in codfw [puppet] - 10https://gerrit.wikimedia.org/r/226065 [08:19:12] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for mc2* systems in codfw [puppet] - 10https://gerrit.wikimedia.org/r/226065 (owner: 10Muehlenhoff) [08:40:19] !log upgrading zuul to zuul_2.0.0-327-g3ebedde-wmf3precise1 to fix a regression ( https://phabricator.wikimedia.org/T106531 ) [08:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:29] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia - https://phabricator.wikimedia.org/T106499#1478406 (10hashar) [08:51:51] (03CR) 10Muehlenhoff: "I think you're right about the jobrunners and localhost. While it's listening for external connections, the dispatcher function only acces" [puppet] - 10https://gerrit.wikimedia.org/r/226506 (https://phabricator.wikimedia.org/T104972) (owner: 10Muehlenhoff) [08:59:35] <_joe_> hashar: I'm having big issues building etcd for trusty [08:59:58] _joe_: catching up in a few. I am in conf call with zeljkof [09:01:03] <_joe_> hashar: yeah don't worry :) [09:01:05] <_joe_> no rush [09:04:28] _joe_: ok back [09:04:34] was it solely to run the tests on the CI slaves ? [09:04:57] I am wondering whether we could get etcd installed via pip [09:05:21] since the test entry point runs tox which runs pip, you could get the etcd client version you are interested in pinned in requirements.txt [09:05:35] and the Jenkins job would download and install the proper etcd in the venv [09:05:40] saves you from having to build the package [09:06:03] that is until the CI Jessie slaves are ready :/ [09:06:05] <_joe_> no, etcd is a go software [09:06:17] <_joe_> not python-etcd [09:06:22] <_joe_> that's properly packaged [09:06:44] <_joe_> it's the server-side software we have to run [09:06:55] <_joe_> for integration tests [09:07:07] <_joe_> hashar: btw, do we have any jessie ci slave? [09:07:19] <_joe_> we can just run this on jessie [09:07:59] (03CR) 10Filippo Giunchedi: beta: Add script from Jenkins beta-update-databases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) (owner: 10Thcipriani) [09:09:12] _joe_: ah my bad :-/ [09:09:41] _joe_: so the contint puppet manifests do not pass on Jessie because it includes mediawiki::packages and there is a bunch of font related packages that have been renamed between Ubuntu and Jessie [09:09:48] (03PS1) 10Giuseppe Lavagetto: confctl: fix warning on regexes [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 [09:09:50] (03PS1) 10Giuseppe Lavagetto: confctl: don't create inexistent entities [software/conftool] - 10https://gerrit.wikimedia.org/r/226683 (https://phabricator.wikimedia.org/T104574) [09:09:57] we also have an Xvfb daemon which is wrapped with upstart. Need to migrate it to systemd [09:10:01] and from there puppet will pass properly [09:10:39] though I have setup a Jessie slave that does not bring the mediawiki manifests. That was to run the gdnsd linter on Jessie [09:10:42] <_joe_> ok, I have no time for that now, we might want to skip integration tests for now [09:10:43] we can do the same for etcd [09:10:55] <_joe_> hashar: we can actually use the same slave? [09:10:58] yeah [09:11:10] in a big corporate world I would tell you no :D [09:11:21] should we just install the etcd package ? [09:11:32] <_joe_> yes [09:12:32] * hashar struck by wikitech slowness [09:12:35] <_joe_> there is a reason why I work here, besides the awesomeness of our project and being an ngo: we're not in a big corporate world [09:12:38] <_joe_> :) [09:12:45] yeah [09:12:49] I love it myself [09:13:00] drawback is we all end up with a ton of pressure on our shoulders hehe [09:13:08] so the dumb Jessie jenkins slave is integration-lightslave-jessie-1002 [09:13:23] which has the puppet class role::ci::slave::labs::light [09:13:59] that bring in the very minimum puppet classes to have the instance pooled as a slave [09:14:00] and [09:14:01] include authdns::lint [09:14:11] so we can dish in the role the installation of the etcd package [09:14:22] (03PS2) 10Filippo Giunchedi: admin: add daniel to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/226055 (https://phabricator.wikimedia.org/T106047) [09:14:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add daniel to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/226055 (https://phabricator.wikimedia.org/T106047) (owner: 10Filippo Giunchedi) [09:14:45] <_joe_> hashar: I'll review your conftool patches now [09:15:27] _joe_: should I just require_package 'etcd', 'etcdctl' in the role ? [09:15:43] I am not sure I am willing to introduce an etcd::packages class :} [09:15:44] <_joe_> hashar: just 'etcd' [09:16:17] then we can use latest [09:16:21] to make sure it is up to date [09:16:30] I disabled unattended upgrade :/ [09:16:36] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog, 5Patch-For-Review: Provide daniel (Daniel Kinzler) with Hive access - https://phabricator.wikimedia.org/T106047#1478457 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi merged, access should be available shortly [09:18:15] (03PS1) 10Hashar: contint: install etcd latest on Jessie light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226685 [09:18:25] might have a bug for that [09:19:09] _joe_: here is etcd landing on the Jessie light Jenkins slave [09:19:44] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: install etcd latest on Jessie light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226685 (owner: 10Hashar) [09:20:20] (03PS4) 10Filippo Giunchedi: access: New production ssh key for awight [puppet] - 10https://gerrit.wikimedia.org/r/226488 (owner: 10Matanya) [09:20:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] access: New production ssh key for awight [puppet] - 10https://gerrit.wikimedia.org/r/226488 (owner: 10Matanya) [09:21:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: New production ssh key for awight - https://phabricator.wikimedia.org/T106625#1478482 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi key added, thanks @matanya ! [09:25:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Setup tox for easy venv [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [09:25:02] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1478496 (10hashar) Per discussion with @Joe, CI will run the etcd Jenkins jobs on Jessie. Saves you from having to package it for Trusty solely for CI... [09:25:54] (03PS2) 10Filippo Giunchedi: access: shell account for Srijan Kumar [puppet] - 10https://gerrit.wikimedia.org/r/226491 (owner: 10Matanya) [09:26:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix flake8 issues [software/conftool] - 10https://gerrit.wikimedia.org/r/222291 (owner: 10Hashar) [09:27:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] tests: create proper loggers [software/conftool] - 10https://gerrit.wikimedia.org/r/222302 (owner: 10Hashar) [09:29:19] (03PS3) 10Filippo Giunchedi: access: shell account for Srijan Kumar [puppet] - 10https://gerrit.wikimedia.org/r/226491 (owner: 10Matanya) [09:29:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] access: shell account for Srijan Kumar [puppet] - 10https://gerrit.wikimedia.org/r/226491 (owner: 10Matanya) [09:29:57] (03PS3) 10Filippo Giunchedi: access: grant srijan access to stat1003 via research group [puppet] - 10https://gerrit.wikimedia.org/r/226492 (owner: 10Matanya) [09:31:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] access: grant srijan access to stat1003 via research group [puppet] - 10https://gerrit.wikimedia.org/r/226492 (owner: 10Matanya) [09:33:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1478520 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi merged, access should be granted shortly. Make sure to let us know when offboarding. [09:37:02] (03PS2) 10Filippo Giunchedi: access: add dcausse to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/226117 (owner: 10Matanya) [09:37:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] access: add dcausse to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/226117 (owner: 10Matanya) [09:38:51] _joe_: did you want python 2.7 / 3.4 or both ? [09:38:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access request to stat1002 for dcausse - https://phabricator.wikimedia.org/T106370#1478527 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi merged, reopen if access doesn't work [09:38:58] currently there is only 2.7 [09:40:20] 10Ops-Access-Requests, 6operations: sudo request for Matanya to perform server side uploads - https://phabricator.wikimedia.org/T106447#1478530 (10fgiunchedi) [09:40:33] <_joe_> hashar_: 2.7 is ok [09:40:44] https://gerrit.wikimedia.org/r/#/c/226688/1/zuul/layout.yaml,unified [09:40:48] that migrates to Jessie [09:41:12] special abusefilter page on en.wp seems broken [09:41:22] Function: IndexPager::buildQueryInfo (AbuseFilterPager) [09:41:22] Error: 2013 Lost connection to MySQL server during query (10.64.32.23) [09:42:01] <_joe_> jynus, springle ^^ [09:42:17] <_joe_> thedj: repeatedly so or just once? [09:42:19] (03PS2) 10Filippo Giunchedi: access: grant Tilman Bayer access the Analytics data [puppet] - 10https://gerrit.wikimedia.org/r/226615 (https://phabricator.wikimedia.org/T105748) (owner: 10Matanya) [09:42:21] repeatedly [09:42:25] (03CR) 10jenkins-bot: [V: 04-1] access: grant Tilman Bayer access the Analytics data [puppet] - 10https://gerrit.wikimedia.org/r/226615 (https://phabricator.wikimedia.org/T105748) (owner: 10Matanya) [09:42:36] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Database_problem [09:43:53] I've seen it fail a few time last night [09:44:11] <_joe_> it seems it's not even specific to one host [09:44:34] (03PS3) 10Filippo Giunchedi: access: grant Tilman Bayer access the Analytics data [puppet] - 10https://gerrit.wikimedia.org/r/226615 (https://phabricator.wikimedia.org/T105748) (owner: 10Matanya) [09:46:29] <_joe_> jynus, thedj I think the query is being too slow to complete within our predefined timeout [09:46:40] _joe_, yes, that is the error [09:46:46] so abusefilter stopped being scalable :) [09:47:14] I can help pinpoint the isse, but it is always the application! :-) [09:47:28] <_joe_> jynus: it's always the application :P [09:48:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] access: grant Tilman Bayer access the Analytics data [puppet] - 10https://gerrit.wikimedia.org/r/226615 (https://phabricator.wikimedia.org/T105748) (owner: 10Matanya) [09:48:32] <_joe_> jynus: or, if you speak to a developer, it's always the database :P [09:49:15] 10Ops-Access-Requests, 6operations, 6Reading-Admin, 5Patch-For-Review: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1478547 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi merged, should be available shortly [09:50:27] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Upgrade production to elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106165#1478551 (10fgiunchedi) [09:50:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access request to stat1002 for dcausse - https://phabricator.wikimedia.org/T106370#1478552 (10dcausse) Thanks! [09:52:36] <_joe_> why is puppet disabled on restbase1001? [09:54:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] tests: catch KVObject.setup() SystemExit [software/conftool] - 10https://gerrit.wikimedia.org/r/222313 (owner: 10Hashar) [09:54:46] hehe [09:55:53] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I'll merge this on Mon" [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [09:55:57] jynus: created https://phabricator.wikimedia.org/T106798 [09:56:11] thedj, thank you [09:56:14] _joe_: I am making flake8 voting for conftool [09:56:18] _joe_: was a reason provided? [09:56:50] <_joe_> godog: yes, "gc logging, ask eevans" [09:57:15] <_joe_> hashar: I just voted +2 on that btw [09:57:23] yeah it is landing [09:58:10] thank you also for the feedback on the page [09:58:17] (enwiki, I mean) [09:59:43] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 (owner: 10Giuseppe Lavagetto) [10:00:18] of course tox is not available on the Jessie light slave :D [10:01:14] wow [10:01:17] Range checked for each record (index map: 0x80) [10:01:25] godog, _joe_: he logged "restarting Cassandra on restbase1001 to (temporarily) enable GC logging" in SAL yesterday [10:01:38] <_joe_> moritzm: yes, saw that [10:01:43] "It's a long time since I heard that name" [10:02:19] (03PS2) 10Giuseppe Lavagetto: confctl: fix warning on regexes [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 [10:06:08] thedj: iirc that table recently had a new column added to it [10:07:20] thedj: https://phabricator.wikimedia.org/rEABF77d161f65c2730c94807479caa05832801405956 added a join [10:07:56] it refuses to use that index now [10:08:33] because of the join? should we revert it? [10:08:56] can we hint which index should be used? [10:09:06] no, it is refused too [10:09:28] I can try to create on one db an additional index [10:09:34] and see if it works [10:09:43] I think we should just revert it for now [10:09:49] ok, for me [10:10:14] (but I do not know the consequences, so I cannot say) [10:10:32] do we have a diff of the index [10:10:35] ? [10:10:57] because that new index couldn't exist before the column [10:11:21] um, no new index was added? [10:11:36] let me comment on the ticket to clarify [10:11:54] what is happending, then you can take an informed decision :-) [10:12:06] ok :) [10:14:45] <_joe_> hashar: need me to do anything else in order to make the tests run? [10:15:57] (03PS1) 10Hashar: contint: pin tox to version 1.9.2 [puppet] - 10https://gerrit.wikimedia.org/r/226694 (https://phabricator.wikimedia.org/T106799) [10:17:10] legoktm, T106798#1478602 [10:17:17] _joe_: yeah I need to install pip / tox. Working on it [10:17:43] I would agree with a revert, but I would assume it could be fixed whith further investigation [10:18:20] _joe_: yak shaving a few other things first [10:18:52] jynus: since this was just a small feature request, I'm going to revert now, and we can re-work the implementation later [10:19:06] ok with me [10:19:29] it is consistently failing, so I think it is wiser [10:20:45] also, it is not a "table has grown and now it does not scale", it makes the query 100x slower [10:21:01] (03CR) 10Hashar: "Breaks puppet with:" [puppet] - 10https://gerrit.wikimedia.org/r/226237 (https://phabricator.wikimedia.org/T62720) (owner: 10Niedzielski) [10:22:47] I have subscribed to the original feature request and will try to help [10:23:08] !log legoktm Synchronized php-1.26wmf15/extensions/AbuseFilter/: Special:AbuseFilter on all large Wikipedias is returning errors - T106798 (duration: 00m 13s) [10:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:36] https://en.wikipedia.org/wiki/Special:AbuseFilter loads now [10:23:57] (03PS1) 10Hashar: contint: fully qualify unix command [puppet] - 10https://gerrit.wikimedia.org/r/226696 (https://phabricator.wikimedia.org/T62720) [10:24:09] (03CR) 10Hashar: "Fixed with https://gerrit.wikimedia.org/r/226696" [puppet] - 10https://gerrit.wikimedia.org/r/226237 (https://phabricator.wikimedia.org/T62720) (owner: 10Niedzielski) [10:24:52] _joe_, what I meant with my original reaction is that the error show to users usually says "Database error", and I am usually 100% blamed [10:25:21] this is noones fault, and usually DB errors means ops and devs have to work toghether :-) [10:25:27] maybe we can change the message to something like: MediaWiki encountered an internal error while attempting to reach the data store [10.X.X.X] [10:25:42] do not blame mediawiki either :-) [10:25:50] (03CR) 10Hashar: [C: 04-1] "Error: Failed to apply catalog: Could not find dependency User[jenkins-deploy] for Exec[jenkins-deploy kvm membership] at /etc/puppet/modu" [puppet] - 10https://gerrit.wikimedia.org/r/226696 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:26:09] ah puppet [10:26:11] well, the current db error is much better than "MWException [hash]" :) [10:26:13] something neutral like "file an error with this description" [10:26:35] thedj: fixed btw ^^ [10:26:55] legoktm, let me check it on the error logs [10:27:14] ok, I loaded Special:AbuseFilter on a few large projects and it worked [10:29:13] * Nemo_bis randomly blames the trivial filters for style issues [10:30:10] implement jshint/jscs for gadgets via AbuseFilter? ;) [10:32:05] legoktm: someone talked about having linters/validator to be run when a page is saved and report back to user on failure [10:32:39] (03PS1) 10Hashar: contint: drop require to a user in LDAP [puppet] - 10https://gerrit.wikimedia.org/r/226698 (https://phabricator.wikimedia.org/T62720) [10:33:25] (03CR) 10Hashar: "Dropped the User['jenkins-deploy'] requirement with https://gerrit.wikimedia.org/r/#/c/226698/" [puppet] - 10https://gerrit.wikimedia.org/r/226696 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:39:05] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/226698 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:39:13] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/226696 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:45:32] (03PS2) 10Hashar: contint: pin tox to version 1.9.2 [puppet] - 10https://gerrit.wikimedia.org/r/226694 (https://phabricator.wikimedia.org/T106799) [10:46:16] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppet master and confirmed to work:" [puppet] - 10https://gerrit.wikimedia.org/r/226694 (https://phabricator.wikimedia.org/T106799) (owner: 10Hashar) [10:49:15] (03PS1) 10Giuseppe Lavagetto: conftool: fix integration tests [software/conftool] - 10https://gerrit.wikimedia.org/r/226701 [10:49:43] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: fix integration tests [software/conftool] - 10https://gerrit.wikimedia.org/r/226701 (owner: 10Giuseppe Lavagetto) [10:49:48] (03CR) 10Hashar: "The pip puppet provider was introduced by https://gerrit.wikimedia.org/r/#/c/111536/" [puppet] - 10https://gerrit.wikimedia.org/r/226694 (https://phabricator.wikimedia.org/T106799) (owner: 10Hashar) [10:50:58] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: pin tox to version 1.9.2 [puppet] - 10https://gerrit.wikimedia.org/r/226694 (https://phabricator.wikimedia.org/T106799) (owner: 10Hashar) [10:51:08] what a mess [10:51:54] <_joe_> what is a mess? [10:52:27] (03PS2) 10Giuseppe Lavagetto: contint: fully qualify unix command [puppet] - 10https://gerrit.wikimedia.org/r/226696 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:53:04] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: fully qualify unix command [puppet] - 10https://gerrit.wikimedia.org/r/226696 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:53:53] since the mediawiki::packages is not ready for Jessie, I can't include contint::packages or contint::packages::labs [10:54:00] (03PS2) 10Giuseppe Lavagetto: contint: drop require to a user in LDAP [puppet] - 10https://gerrit.wikimedia.org/r/226698 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:54:01] the later is the one providing tox :-D [10:54:15] so I would have to create yet another class such as contint::packages::labs::tox [10:54:23] <_joe_> nah [10:54:32] <_joe_> lemme look into it [10:54:43] or copy paste the tox installation in the role::ci::lightslave role [10:55:04] <_joe_> which class sets up tox? [10:55:16] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: drop require to a user in LDAP [puppet] - 10https://gerrit.wikimedia.org/r/226698 (https://phabricator.wikimedia.org/T62720) (owner: 10Hashar) [10:59:18] _joe_: contint::packages::labs [10:59:24] you can grep for "'tox'" [11:01:36] PROBLEM - puppet last run on db2056 is CRITICAL puppet fail [11:09:35] RECOVERY - puppet last run on db2056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:56] 6operations, 10RESTBase-Cassandra: provide restbase with systemd unit files - https://phabricator.wikimedia.org/T106806#1478728 (10fgiunchedi) 3NEW [11:29:26] I could use a review of https://gerrit.wikimedia.org/r/#/c/226087/ [11:33:27] Glaisher, of course I will help with both bugs you recently wrote about [11:33:46] Thanks :) [11:34:07] it is just that if they are not repeatable, they will not have priority over other tasks [11:34:14] (unlike the recent one) [11:36:40] (03PS1) 10Muehlenhoff: Add backports of recent Linux kernel NMI security fixes [debs/linux] - 10https://gerrit.wikimedia.org/r/226703 [11:37:00] there are sadly ton of small impact errors, and I can only help with a few at a time while maintaining the db servers at the same time [11:37:40] we could, however, improve the workflow for query optimization, although not sure how [11:58:02] 6operations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#1478828 (10fgiunchedi) +1 to the new scheme >>! In T106404#1468756, @chasemp wrote: > Does anyone think the main weakness it exposes re: mw2148.yaml, mw2149.yaml, mw2150.yaml, and mw2151.yaml would be better resolved by a bette... [12:14:18] 6operations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#1478875 (10faidon) > The disadvantage of this approach is that it is not always possible to select the nodes of some logical group using a common string prefix. For example, codfw image scalers are mw2148 - mw2151. In this case,... [12:18:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add backports of recent Linux kernel NMI security fixes [debs/linux] - 10https://gerrit.wikimedia.org/r/226703 (owner: 10Muehlenhoff) [12:46:03] 10Ops-Access-Requests, 6operations: Requesting access for joal to resources [stat1001, stat1002, stat1003, bast1001.wikimedia.org, Hadoop cluster, eventlogging1001, hafnium] - New key after laptop stolen - https://phabricator.wikimedia.org/T106812#1478952 (10JAllemandou) 3NEW a:3Ottomata [12:47:06] !log Jenkins: switching gearman plugin from our custom compiled 0.1.1-9-g08e9c42-change_192429_2 to upstream 0.1.2. They are actually the exact same versions. [12:47:10] !log restarting Jenkins [12:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:46] (03PS2) 10Muehlenhoff: Add ferm rules for MX mail servers [puppet] - 10https://gerrit.wikimedia.org/r/226078 (https://phabricator.wikimedia.org/T104979) [12:47:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for MX mail servers [puppet] - 10https://gerrit.wikimedia.org/r/226078 (https://phabricator.wikimedia.org/T104979) (owner: 10Muehlenhoff) [12:52:19] (03PS1) 10Muehlenhoff: Enable ferm on lead [puppet] - 10https://gerrit.wikimedia.org/r/226707 (https://phabricator.wikimedia.org/T104979) [12:53:36] (03PS1) 10Muehlenhoff: Enable ferm for polonium [puppet] - 10https://gerrit.wikimedia.org/r/226708 (https://phabricator.wikimedia.org/T104979) [12:55:02] (03PS7) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [13:05:25] 6operations: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1478979 (10Aklapper) @Deskana: Any idea who could decide on this request or in whose court that could be? :-/ [13:11:12] !log swapping ssds in restbase1007 [13:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:10] 6operations: Requesting access for joal to resources [stat1001, stat1002, stat1003, bast1001.wikimedia.org, Hadoop cluster, eventlogging1001, hafnium] - New key after laptop stolen - https://phabricator.wikimedia.org/T106812#1478997 (10JohnLewis) p:5Triage>3Normal [13:18:00] 6operations: Requesting access for joal to resources [stat1001, stat1002, stat1003, bast1001.wikimedia.org, Hadoop cluster, eventlogging1001, hafnium] - New key after laptop stolen - https://phabricator.wikimedia.org/T106812#1479002 (10JohnLewis) Looking at it, this is just a new ssh key request after your old o... [13:22:49] (03PS1) 10John F. Lewis: admin: add new ssh key for joal [puppet] - 10https://gerrit.wikimedia.org/r/226710 (https://phabricator.wikimedia.org/T106812) [13:24:11] godog: ^^ [13:24:23] 7Puppet: Puppet Trebuchet provider compares refname with commit sha1 and does NOT refresh the git repo! - https://phabricator.wikimedia.org/T77002#1479017 (10hashar) 5Open>3Resolved Ran puppet with --debug on deployment-jobrunner01.deployment-prep.eqiad.wmflabs, I have inserted the sha1 reported by the vario... [13:25:01] joal: ^^ new ssh-key patch. +1 if you feel like it though the phab comment was enough :) [13:27:16] JohnFLewis: nifty! thanks [13:27:50] JohnFLewis: how we know you are you !! :D [13:28:15] hashar: ask me a mailman question ;) [13:28:22] ahah [13:28:50] JohnFLewis: by the way lanthanum is being phased out. Depends on disk being wiped which as i understand it need physical presence in the datacenter [13:29:19] hashar: which is where cmjohnson1 comes in :D [13:30:00] yep, it a low priority item but i have plans of wiping today [13:30:22] (03CR) 10Joal: [C: 031] "Thank !" [puppet] - 10https://gerrit.wikimedia.org/r/226710 (https://phabricator.wikimedia.org/T106812) (owner: 10John F. Lewis) [13:30:27] Thanks JohnFLewis , LGTM :) [13:30:30] cmjohnson1: yeah I guess it all depends on when JohnFLewis needs the machine [13:30:54] i will start the wipe today ...should be fine for Monday [13:31:01] JohnFLewis: it has a couple 128GB SSD iirc, and served CI very well. It is a good machine. [13:31:02] hashar: me? the closest use I have for it is a sodium replacement but that's a VM :( [13:31:46] I don't get to decide things with machines or use them because not being staff, you don't get many volunteer inspired ideas which makes robh assign a machine for you [13:33:06] (03PS1) 10Giuseppe Lavagetto: contint::packages: factor c::p::python out [puppet] - 10https://gerrit.wikimedia.org/r/226711 [13:33:08] (03PS1) 10Giuseppe Lavagetto: contint::packages: factor c::p::apt out [puppet] - 10https://gerrit.wikimedia.org/r/226712 [13:33:22] <_joe_> hashar: ^^ tell me what you think of this [13:33:34] <_joe_> I divided the packages in subunits [13:33:41] <_joe_> well, for now just python [13:33:51] <_joe_> but we should probably do the same with ruby [13:36:47] (03CR) 10Hashar: [C: 031] "Ah yeah that is way nicer than the crazy hack I talked about on IRC. Will certainly please Faidon in the long term since he regularly com" [puppet] - 10https://gerrit.wikimedia.org/r/226711 (owner: 10Giuseppe Lavagetto) [13:36:52] _joe_: yeah that is way better [13:37:12] the packages.pp and packages/labs.pp have grown up organically [13:37:31] the python one looks all fine to me [13:37:36] 6operations: Update wikimedia apt repo to include debs for shiny-server - https://phabricator.wikimedia.org/T106435#1479046 (10fgiunchedi) p:5Triage>3Low is the package already in use? Some of the maintainer scripts could be simplified (https://github.com/rstudio/shiny-server/tree/master/packaging/debian-con... [13:37:38] want me to cherry pick it and double check it works ? [13:37:47] <_joe_> hashar: that would be sweet [13:38:00] <_joe_> beware, it's gonna install python3 on precises as well [13:39:14] oh [13:39:23] _joe_: I havent caught that [13:40:04] _joe_: Precise only has python3.2 [13:40:22] that is why it ended up in the os_version('ubuntu >= trusty') harness [13:40:32] <_joe_> yes, so I'm using the virtual packages, python3 and python3-dev [13:40:54] <_joe_> it doesn't do any harm, and allows to use the correct package name in the future [13:41:14] and Trusty python3 is python3.4 so it is fine [13:41:22] though I prefer explicit versions usually. It is not a big deal [13:41:33] that is one less os_version() condition floating around [13:43:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add new ssh key for joal [puppet] - 10https://gerrit.wikimedia.org/r/226710 (https://phabricator.wikimedia.org/T106812) (owner: 10John F. Lewis) [13:44:01] joal: you may now resume usual activities shortly. :) [13:44:46] !log swapping failed disk db1058 [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:45:17] Thanks a lot JohnFLewis for the seepdy merge :) [13:45:36] thank godog for that, I wish I could speedily merge things ;) [13:46:05] Thanks godog as well ! [13:46:08] :) [13:51:30] _joe_: cherry picking both on CI puppetmaster . Will run puppet agent on precise/trusty [13:51:54] <_joe_> ok :) [13:52:02] <_joe_> hope I didn't screw something up [13:53:02] damn [13:53:15] remembers me I need to migrate labs puppetmaster::self to use mysql instead of sqlite [13:53:20] too many locks :/ [13:54:17] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1479073 (10Cmjohnson) The new Samsung Pros have been added to restbase1007 and restbase1009. restbase1008 has the intels. [13:54:37] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on puppetmaster and it works :-}" [puppet] - 10https://gerrit.wikimedia.org/r/226712 (owner: 10Giuseppe Lavagetto) [13:54:45] (03CR) 10Hashar: "Cherry picked on puppetmaster and it works :-}" [puppet] - 10https://gerrit.wikimedia.org/r/226711 (owner: 10Giuseppe Lavagetto) [13:54:52] _joe_: good to be merged. Thx! [13:55:18] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 (owner: 10Giuseppe Lavagetto) [13:57:52] hehe no problem joal, thanks JohnFLewis for your help! [13:58:26] (03PS1) 10Hashar: contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) [13:58:28] godog: login still refused on stat1002, normal ? [14:00:37] joal: it can take up to 20-30m, I see puppet has just ran though [14:00:44] :( [14:00:49] IOW, "try again" :) [14:01:19] godog: nope :( [14:01:33] godog: Agent admitted failure to sign using the key. [14:02:13] to bast1001 [14:02:41] (03CR) 10Hashar: [C: 031] "Cherry picked on puppet master. Now integration-lightslave-jessie-1002 has pip and tox :-}" [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [14:02:56] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 (owner: 10Giuseppe Lavagetto) [14:03:19] joal: that sounds like an issue with your ssh agent rather than the server [14:03:31] doh [14:03:38] I get that message when I accidentally click 'deny' when my ssh agent pops up 'do you want to use this key?' [14:03:42] _joe_: turns out we need a gcc compiler for python packages :-D https://integration.wikimedia.org/ci/job/tox-py27-jessie/3/console [14:03:53] hm, thx valhallasw`cloud, will try logout/log bak in [14:04:30] (03CR) 10Hashar: [C: 04-1] "Turns out we need gcc as well :(" [puppet] - 10https://gerrit.wikimedia.org/r/226711 (owner: 10Giuseppe Lavagetto) [14:04:32] <_joe_> hashar: ah, right [14:04:39] <_joe_> ok, adding it is not a problem [14:04:50] <_joe_> I'd say we need build-essential :) [14:04:53] ahah [14:05:16] valhallasw`cloud, godog --> WORKS ! [14:05:27] Thanks a lot [14:05:48] you're welcome :-) [14:06:02] _joe_: want me to amend? [14:06:39] <_joe_> hashar: I'm on it [14:07:17] joal: yw [14:10:14] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [14:10:14] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [14:10:14] PROBLEM - check_mysql on payments1003 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [14:12:08] (03PS2) 10Giuseppe Lavagetto: contint::packages: factor c::p::python out [puppet] - 10https://gerrit.wikimedia.org/r/226711 [14:12:10] (03PS2) 10Giuseppe Lavagetto: contint::packages: factor c::p::apt out [puppet] - 10https://gerrit.wikimedia.org/r/226712 [14:12:29] <_joe_> hashar: I added 'build-essential' to the list of builded packages [14:15:15] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [14:15:15] RECOVERY - check_mysql on payments1004 is OK: Uptime: 72437 Threads: 2 Questions: 61860 Slow queries: 83 Opens: 554 Flush tables: 3 Open tables: 61 Queries per second avg: 0.853 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [14:20:14] RECOVERY - check_mysql on payments2001 is OK: Uptime: 70986 Threads: 3 Questions: 19709 Slow queries: 11 Opens: 277 Flush tables: 1 Open tables: 45 Queries per second avg: 0.277 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [14:20:15] RECOVERY - check_mysql on payments1003 is OK: Uptime: 294 Threads: 1 Questions: 1093 Slow queries: 19 Opens: 234 Flush tables: 1 Open tables: 64 Queries per second avg: 3.717 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [14:21:58] (03CR) 10Ottomata: "Not a bad idea!" [puppet] - 10https://gerrit.wikimedia.org/r/226646 (owner: 10Ori.livneh) [14:22:03] _joe_: cherry picking / running puppet / trigger jenkins job :-} [14:22:34] (03PS1) 10Muehlenhoff: Use fixed ports for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/226717 (https://phabricator.wikimedia.org/T105040) [14:22:36] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 1499 MB (3% inode=57%) [14:22:39] (03PS2) 10Hashar: contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) [14:24:36] RECOVERY - Disk space on copper is OK: DISK OK [14:28:36] _joe_: you can merge the patches [14:28:43] <_joe_> hashar: ok [14:28:54] (03PS3) 10Giuseppe Lavagetto: contint::packages: factor c::p::python out [puppet] - 10https://gerrit.wikimedia.org/r/226711 [14:28:55] fails compiling because of a missing header ( openssl/aes.h ) but will do that as a follow up change [14:29:22] 10Ops-Access-Requests, 6operations, 10Graphoid: Allow mobrovac to restart Graphoid - https://phabricator.wikimedia.org/T106814#1479133 (10mobrovac) 3NEW [14:29:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint::packages: factor c::p::python out [puppet] - 10https://gerrit.wikimedia.org/r/226711 (owner: 10Giuseppe Lavagetto) [14:31:13] (03PS3) 10Giuseppe Lavagetto: contint::packages: factor c::p::apt out [puppet] - 10https://gerrit.wikimedia.org/r/226712 [14:31:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint::packages: factor c::p::apt out [puppet] - 10https://gerrit.wikimedia.org/r/226712 (owner: 10Giuseppe Lavagetto) [14:31:47] _joe_: so to compile python-cryptography , we need libssl-dev [14:31:53] not sure where to stick that package though [14:32:02] I have installed it manually and the job pass now https://integration.wikimedia.org/ci/job/tox-py27-jessie/6/console [14:32:34] <_joe_> hashar: I can look into this as well [14:33:02] <_joe_> libssl-dev is not in build-essential, meh [14:33:26] another way would be to install python-cryptography [14:33:38] and have virtualenv reuse system packages when they fit the requirement [14:33:43] that would save the compilation time [14:35:03] but I dont think pip supports that [14:35:45] build-essential is only gcc/g++/make/dpkg-dev, all other libs than glibc need to be added manually [14:37:07] having a break, I am wasted / need water etc [14:39:25] hashar: if you use --system-site-packages, system packages will be used if they are new enough [14:39:38] but if a newer one is required, pip will install it [14:39:39] iirc [14:41:33] 10Ops-Access-Requests, 6operations: sudo request for Matanya to perform server side uploads - https://phabricator.wikimedia.org/T106447#1479174 (10Krenair) I'm not sure this would need any extra sudo access beyond what the restricted group already grants for mwscript... [14:42:04] PROBLEM - puppet last run on mw2137 is CRITICAL Puppet has 1 failures [14:43:44] <_joe_> hashar: what valhallasw`cloud said :) [14:44:57] valhallasw`cloud: yeah that would work [14:45:08] but if system packages provide some dependencies that developers failed to mention in requirements.txt [14:45:13] we will not caught them :/ [14:45:29] <_joe_> hashar: not that a big deal. [14:45:32] I am looking for a way for pip to install requirements from the system if available [14:50:12] hashar: ah! mm, not sure if that's possible, but we could have a wheel cache to reduce compilation time [14:50:14] PROBLEM - check_puppetrun on indium is CRITICAL puppet fail [14:50:38] valhallasw`cloud: I am wiling to learn! :D [14:50:56] valhallasw`cloud: I noticed this week we keep recompiling python yaml over and over [14:51:15] it's in pip 7+, but I haven't played with it much [14:51:16] "Wheel cache - Pip will read from the subdirectory wheels within the pip cache dir and use any packages found " [14:51:32] ah, and you can pass --cache-dir=... [14:52:10] we have a .pip/pip.conf with solely: [14:52:15] [install] [14:52:15] download-cache = ~/cache/pip [14:53:32] that should be nonfunctional in pip 7 [14:53:54] pip by default uses ~/.cache/pip [14:54:08] but that's per-user which might be an issue? not sure if different users are used? [14:54:19] so just upgrading to pip 7 might solve all your problems! :-) [14:54:48] we have only one user [14:55:08] so I guess I can get rid of the pip.conf [14:55:14] and upgrade pip to 7.x [14:55:15] RECOVERY - check_puppetrun on indium is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:55:18] and see what happens [14:55:59] you can also pre-build packages with --wheel-dir and then use those wheels with --find-links [14:56:04] https://pip.pypa.io/en/latest/user_guide.html#installing-from-wheels [14:56:13] (but also there: never actually tried it! :D) [14:56:32] oh on the Jessie slave there is a .cache/pip/wheels \O/ [14:57:28] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1479187 (10mobrovac) [15:03:43] (03PS1) 10Cmjohnson: DO NOT MERGE YET: DNS changes for logstash1001 and logstash1003 [dns] - 10https://gerrit.wikimedia.org/r/226722 [15:03:58] (03CR) 10jenkins-bot: [V: 04-1] DO NOT MERGE YET: DNS changes for logstash1001 and logstash1003 [dns] - 10https://gerrit.wikimedia.org/r/226722 (owner: 10Cmjohnson) [15:05:26] (03PS2) 10Cmjohnson: DO NOT MERGE YET: DNS changes for logstash1001 and logstash1003 [dns] - 10https://gerrit.wikimedia.org/r/226722 [15:08:36] RECOVERY - puppet last run on mw2137 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:10:22] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1479227 (10Cmjohnson) The racks are prepped for the move. Switches have been updated DNS patch is ready to merge https://gerrit.wikimedia.org/r/#/c/226722/ Just need to do the physical removal of disks... [15:11:05] 6operations, 10ops-eqiad: wipe disks for lanthanum - https://phabricator.wikimedia.org/T105901#1479229 (10Cmjohnson) Disks are being wiped, removing the ssd from the server before adding back to the spare list [15:12:31] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1479238 (10jcrespo) So, summarizing @robh @springle: * 3 nodes per cluster, (ideally 4) x 3 clusters x 2 datacenters = 18 - 24 nodes. 2 out of the 3 clusters per datacenter are needed wit... [15:12:45] RECOVERY - RAID on db1058 is OK optimal, 1 logical, 2 physical [15:13:42] ^the week ends much better than how it started :-) [15:14:09] thanks, cmjohnson1 ! [15:14:26] yw [15:14:52] you know how much I love you and papaul :-) [15:15:01] 6operations, 10ops-eqiad: db1058 (s5 master) degraded RAID - https://phabricator.wikimedia.org/T105627#1479244 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Replaced disk...all is good with the world [15:16:35] jynus : thank you [15:17:41] jynus: we're here for you! [15:17:49] :-P [15:18:27] (I am just appealing to your emotions because I will ask you some hard work soon :-P) [15:21:42] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T106654#1479267 (10Cmjohnson) Order a new disk from Dell. It will be here Monday  [15:25:53] (03PS1) 10Hashar: contint: install pip 7.1.0 from pypi [puppet] - 10https://gerrit.wikimedia.org/r/226729 [15:25:55] (03PS1) 10Hashar: contint: drop pip obsolete download-cache option [puppet] - 10https://gerrit.wikimedia.org/r/226730 [15:27:29] (03CR) 10Merlijn van Deen: [C: 031] contint: install pip 7.1.0 from pypi [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [15:27:49] (03CR) 10Merlijn van Deen: [C: 031] contint: drop pip obsolete download-cache option [puppet] - 10https://gerrit.wikimedia.org/r/226730 (owner: 10Hashar) [15:29:50] 6operations, 7Documentation: Update wiki documentation related to RT - https://phabricator.wikimedia.org/T76990#1479285 (10chasemp) 5Open>3Resolved a:3chasemp >>! In T76990#1475014, @Aklapper wrote: > #operations: Anything specifically left to do here or can this be closed? seems not [15:32:49] 10Ops-Access-Requests, 6operations: sudo request for Matanya to perform server side uploads - https://phabricator.wikimedia.org/T106447#1479298 (10hoo) >>! In T106447#1479174, @Krenair wrote: > I'm not sure this would need any extra sudo access beyond what the restricted group already grants for mwscript... I... [15:33:19] 10Ops-Access-Requests, 6operations: Add Matanya to "restricted" to perform server side uploads - https://phabricator.wikimedia.org/T106447#1479299 (10hoo) [15:34:06] (03CR) 10Hashar: "On a Jessie slave, pip now caches download/compiled modules as wheels:" [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [15:37:26] (03PS1) 10Chmarkine: Change protocol relative to https [puppet] - 10https://gerrit.wikimedia.org/r/226731 [15:42:22] (03CR) 10Hashar: "Cherry picked on puppetmaster and upgrade them all:" [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [15:42:34] (03CR) 10Hashar: [C: 031 V: 031] contint: install pip 7.1.0 from pypi [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [15:45:26] _joe_: so as a side effect of this afternoon rush, we have a much better caching strategy for tox / pip jobs (thank you valhallasw`cloud ) [15:46:43] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on puppetmaster. I have confirmed pip 7.1.0 (installed with parent change), does indeed handles cache automatically. It eve" [puppet] - 10https://gerrit.wikimedia.org/r/226730 (owner: 10Hashar) [15:55:32] _joe_: and going to make the tests voting now https://gerrit.wikimedia.org/r/#/c/226733/ :} [15:55:42] <_joe_> hashar: thanks [15:55:57] _joe_: thank you for all the improveements [15:56:07] and I am quite happy to see the python compiled modules to now be cached [15:58:52] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 (owner: 10Giuseppe Lavagetto) [16:00:13] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/226683 (https://phabricator.wikimedia.org/T104574) (owner: 10Giuseppe Lavagetto) [16:00:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:01:48] jouncebot: next [16:01:48] In 70 hour(s) and 58 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150727T1500) [16:04:55] lol [16:05:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:05:55] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1479362 (10mark) So this is just to -test- (temporarily) bare metal hardware in Labs, and will be returned to the spare pool? Anything works, but I'd default to the weakest s... [16:06:39] jgage: Got a minute to help me with scheduling downtime for the ElasticSearch health check in icinga for the logstash_eqiad group? [16:06:58] I tired to self-serve but got "Not Authorized" from icinga [16:07:21] bd808: why is that happening on fridays btw? [16:07:25] perhaps i missed earlier discussions :) [16:07:34] (03PS8) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [16:08:02] mark: Least disruptive to other deploys that may be wanting to watch logstash [16:08:17] but if that's not good in your opinion I can do it some other time [16:08:35] the discussion was mostly in my own head [16:08:36] well there is a reason we don't generally deploy on fridays of course ;) [16:10:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:10:29] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1479392 (10yuvipanda) I have someone in mind (halfak / restoring) who will be requesting this once we have this stable, and hence was requesting hardware that might be suitab... [16:10:38] (03CR) 10coren: [C: 032] " Coren: sure! Consider this a virtual +1? :)" [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [16:10:39] The last update broke kibana for the whole 12 hour upgrade window. On a M-Th that would have been extremely disruptive to swat and the train [16:11:14] and I'm not excited by the idea of getting up in the middle of the night to miss the SF day time window [16:12:00] But like I said if you want me to not do it I certainly will stop [16:12:09] i didn't get concerns from the team or anything [16:12:18] how likely is it that this goes wrong and needs support from us this weekend? :) [16:12:50] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1479400 (10yuvipanda) Dell PowerEdge R310, Single Intel Xeon X3450, 8GB Memory (2) 500GB 3.5 SATA seems to be the 'weakest' I could find on a quick glance [16:12:54] pretty low. I have sudo on all the boxes so I can fix anything other than network/hardware [16:13:15] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479401 (10TJones) @Krenair, assuming stat1002.eqiad.wmnet is the right host, this is what I get: ``` ssh -v stat1002.eqiad.wmnet OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011 debug... [16:13:50] We will know in the first hour or so if there is something horribly wrong with the new elasticsearch version [16:13:56] alright then [16:14:14] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1479402 (10RobH) [16:14:16] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479404 (10Krenair) Your ssh config shouldn't contain references to iron.wikimedia.org. Only ops can log in there. [16:14:31] thanks [16:14:32] bd808: good morning and my best wishes for the upgrade! [16:14:36] rob will schedule icinga downtime for you [16:14:45] bd808: yep, pulling it up now [16:14:52] thanks mark and robh [16:15:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:15:25] and we really should talk sometime about "ownership" of logstash again. [16:15:30] 1008 i think is not ours [16:15:35] yes [16:15:35] bd808: which systems specifically, or all? [16:16:07] robh: all 6 of them. It will be a rolling cluster upgrade. I hope it will be done by 23:00Z today [16:16:35] actually I hope it will be done long before that but ... lets not tempt fait with a short window [16:17:05] so to confirm, I'm going to put logstash1001-1006 into scheduled downtime for ALL services [16:17:18] robh: Just "ElasticSearch health check for shards" [16:17:55] ahh, glad i checked, standby [16:19:38] well, it says its done [16:19:55] In Scheduled Downtime? [16:19:55] YES [16:20:05] bd808: Ok, you have scheduled downtime on those services until 2300 gmt [16:20:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:20:15] perfect. thanks robh [16:20:18] quite welcome [16:20:37] though too bad i didnt have my pagerduty test in place already, we could have left one out and seen if my stuff worked ;D [16:20:39] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1479433 (10mark) I approve using any hardware we have for the purpose of getting bare metal hw in Labs working, provided it's not in the way of other hw allocations. [16:21:07] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1479434 (10RobH) a:5mark>3RobH [16:21:58] db1035? [16:25:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:25:57] !log Upgraded logstash1001 to elasticsearch 1.7.0 [16:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:44] !log Upgraded logstash1002 to elasticsearch 1.7.0 [16:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:31] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1479445 (10RobH) [16:27:31] !log Upgraded logstash1003 to elasticsearch 1.7.0 [16:27:36] easy ones done [16:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:19] (03PS1) 10Yuvipanda: labstore: Move ssh hiera settings into proper place [puppet] - 10https://gerrit.wikimedia.org/r/226735 [16:28:46] Coren: ^ [16:29:43] (03CR) 10Hashar: [C: 04-1] "Puppet fails randomly:" [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [16:30:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:30:28] nice! Looks like the only shard that will rebuild completely is todays [16:31:00] * bd808 crosses fingers and knocks wood [16:31:31] (03CR) 10Yuvipanda: [C: 032] labstore: Move ssh hiera settings into proper place [puppet] - 10https://gerrit.wikimedia.org/r/226735 (owner: 10Yuvipanda) [16:31:43] * greg-g throws some salt around [16:35:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:35:26] (03CR) 10Hashar: "The intermittent error is a PATH issue. I have /usr/local/bin in my env, whereas puppet probably doesn't :(" [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [16:35:57] !log Upgraded logstash1004 to elasticsearch 1.7.0 [16:36:01] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1479599 (10RobH) [16:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:59] (03CR) 10Ori.livneh: [C: 04-1] "Actually, this should be a relative URL. It'll work just the same, and won't require resolving meta.wikimedia.org." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [16:40:15] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:45:14] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [16:46:31] (03PS1) 10Ori.livneh: Follow-up for Ie17cb06: add thumb_handler.php ProxyPass rule to all vhosts [puppet] - 10https://gerrit.wikimedia.org/r/226738 (https://phabricator.wikimedia.org/T84842) [16:48:00] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479665 (10TJones) @Krenair, thanks for the help. Everyone says "stat1002" but the only full host name I've found, in the office wiki, refers to stat1002.eqiad.wmnet. Is that th... [16:48:07] !log Upgraded logstash1005 to elasticsearch 1.7.0 [16:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'm not sure if we can use the same docroot inside all the vhosts, or if we need to differenciate as for all the other clusters." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226738 (https://phabricator.wikimedia.org/T84842) (owner: 10Ori.livneh) [16:48:37] (03CR) 10Ori.livneh: [C: 032] Follow-up for Ie17cb06: add thumb_handler.php ProxyPass rule to all vhosts [puppet] - 10https://gerrit.wikimedia.org/r/226738 (https://phabricator.wikimedia.org/T84842) (owner: 10Ori.livneh) [16:48:42] <_joe_> ahem [16:48:42] oops [16:48:43] <_joe_> :) [16:48:54] <_joe_> ori: don't worry, nothing tragic [16:49:00] <_joe_> go on and merge it [16:49:05] are you sure? [16:49:09] <_joe_> we can amend it later if my doubt holds [16:49:09] I can revert and merge that + revert [16:49:27] ok, let's try it [16:49:27] <_joe_> it should work fine [16:49:36] sorry about that [16:50:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 140 Threads: 1 Questions: 15455 Slow queries: 0 Opens: 105 Flush tables: 2 Open tables: 64 Queries per second avg: 110.392 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [16:50:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479695 (10Krenair) Yes, stat1002 is stat1002.eqiad.wmnet, which is what you should have access to. Please paste the result of ssh -v like you did before, using your new config. [16:50:24] <_joe_> grrrit-wm left in contempt [16:51:38] !log Upgraded logstash1006 to elasticsearch 1.7.0 [16:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:12] (03PS2) 10Hashar: contint: install pip 7.1.0 from pypi [puppet] - 10https://gerrit.wikimedia.org/r/226729 [16:52:55] robh: all done with the logstash cluster upgrade!!! You can reenable the icniga check when you get a chance [16:53:06] that was fast =] [16:53:16] (03CR) 10Hashar: "Lame fix is to put a symlink for /usr/bin/pip https://gerrit.wikimedia.org/r/#/c/226729/1..2/modules/contint/manifests/packages/python.pp," [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [16:53:48] I'm going over to #elasticsearch to sing the praises of the 1.6.0 /_flush/synced feature [16:53:52] 6operations, 7Database: New s3 production cluster for mysql - https://phabricator.wikimedia.org/T106847#1479728 (10jcrespo) 3NEW [16:54:02] it's the greatest thing evar! [16:54:03] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479738 (10TJones) ``` OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011 debug1: Reading configuration data /Users/tjones/.ssh/config debug1: /Users/tjones/.ssh/config line 5: Applying... [16:54:27] (03PS1) 10Yuvipanda: labstore: PEP8 fixes and other minor corrections [puppet] - 10https://gerrit.wikimedia.org/r/226739 [16:54:58] so it was in maint mode [16:55:03] i told it to reenable checks on them. [16:55:09] that shoudl do it, we'll see... [16:55:30] (03PS2) 10Yuvipanda: labstore: PEP8 fixes and other minor corrections [puppet] - 10https://gerrit.wikimedia.org/r/226739 [16:55:41] sorry, said to renable notifications [16:55:49] The upgrade took ~30 minutes this time vs 9+ hours last time [16:55:53] that is awesome [16:56:51] bd808: whoa, really? [16:56:57] done? [16:57:04] yes :) [16:57:14] bd808: enjoy your Friday then! :) [16:57:18] hrmm [16:57:25] its not reenabled out of the window [16:57:34] i see how to schedule a window, but not how to preemptively end a window. [16:57:44] schedule downtime that is. [16:57:59] oh, found it, have to drill in to each one individually... [16:58:14] 6operations, 7Database: New s3 production cluster for mysql - https://phabricator.wikimedia.org/T106847#1479764 (10jcrespo) The latest bump is probably caused by T104278. [16:58:20] silly icinga [16:59:26] yea and then the interface icons dont update right away even though the status details do [16:59:27] heh [17:01:01] but its back to normal now (monitoring) [17:01:18] removed all scheduled downtimes. [17:02:06] (03PS3) 10Yuvipanda: labstore: PEP8 fixes and other minor corrections [puppet] - 10https://gerrit.wikimedia.org/r/226739 [17:02:14] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: PEP8 fixes and other minor corrections [puppet] - 10https://gerrit.wikimedia.org/r/226739 (owner: 10Yuvipanda) [17:11:17] have sweet friday folks ! I am off [17:13:08] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479826 (10Ottomata) @tjones was not added to the `bastiononly` group, which he will need in order to get to pretty much any production node. Doing so now... [17:14:38] (03PS1) 10coren: Minor python3 fixes to storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226742 [17:14:40] (03PS1) 10Ottomata: Add tjones to bastiononly group so he can ssh into stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/226743 (https://phabricator.wikimedia.org/T106175) [17:14:59] YuviPanda: Backups running with ^^ (minor python3 fix) [17:15:03] ottomata: doh, sorry about that (basiononly) [17:15:06] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1479848 (10RobH) 5Resolved>3Open [17:15:11] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1479850 (10GWicke) 5Resolved>3Open [17:15:29] Coren: it moved the entire file to the toollabs folder? [17:15:33] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1146174 (10GWicke) [17:15:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479858 (10TJones) Yay! I'm not (entirely) crazy! Thanks, @Ottomata! [17:15:44] YuviPanda: Oh, no-- that's just me being a moron. [17:15:59] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1479864 (10RobH) This has been reopened, the wikitech page has details for the current specifications (we had a meeting with services earlier today). the result is a new round of quote g... [17:16:03] (03Abandoned) 10coren: Minor python3 fixes to storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226742 (owner: 10coren) [17:16:05] (03CR) 10Ottomata: [C: 032] Add tjones to bastiononly group so he can ssh into stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/226743 (https://phabricator.wikimedia.org/T106175) (owner: 10Ottomata) [17:17:41] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1479872 (10RobH) a:5Eevans>3RobH [17:18:04] 6operations, 10Wikimedia-Logstash, 15User-Bd808-Test: Update Elasticsearch on logstash* to elasticsearch-1.7.0.deb - https://phabricator.wikimedia.org/T106126#1479873 (10bd808) 5Open>3Resolved Done. The addition of `/_flush/synced` in Elasticsearch 1.6.0 made this process much nicer than previous upgrade... [17:19:17] (03PS1) 10coren: Minor python3 fix to storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226744 [17:19:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479890 (10Ottomata) Try now... [17:19:37] YuviPanda: ^^ moron-free version. :-) [17:19:52] (03PS2) 10Yuvipanda: Minor python3 fix to storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226744 (owner: 10coren) [17:20:26] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1479893 (10TJones) I'm in! Thanks very much! [17:21:00] (03PS3) 10Yuvipanda: labstore: Minor python3 fix to storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226744 (owner: 10coren) [17:21:08] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Minor python3 fix to storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226744 (owner: 10coren) [17:21:17] Coren: merged [17:22:10] YuviPanda: I saw. It's running by hand fine right now on all three filesystems so we have a happy fun recent backup. [17:22:45] Coren: cool! I have a manually startable systemd unit on the way [17:23:10] YuviPanda: Well, we need three right? Or are you doing one unit for all three fs? [17:23:28] Coren: ah, no, I did mean 3 - they're just copypasta [17:23:32] (For now) [17:23:52] (03CR) 10AndyRussG: "This is sent in via the ResourceLoaderGetConfigVars hook... Just to check: since it arrives on the client with the RL startup module, cach" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [17:26:44] Coren: btw, we need to archive the older projects that no longer have NFS. [17:27:37] YuviPanda: Hmm. Make a tarball, keep in fs root? Easy enough and easy to restore from at need. [17:27:54] Coren: yup, and will prevent rsync from moving it around [17:28:13] (03PS5) 10Awight: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) [17:28:55] PROBLEM - Apache HTTP on mw1090 is CRITICAL: Connection refused [17:31:13] (03CR) 10Awight: "@AndyRussG: yep, I can confirm the global is sent in via the RL startup module." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [17:32:39] (03CR) 10Ori.livneh: [C: 031] Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [17:35:05] (03PS1) 10Yuvipanda: labstore: Add systemd units for replicating NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/226748 [17:35:10] (03CR) 10jenkins-bot: [V: 04-1] labstore: Add systemd units for replicating NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/226748 (owner: 10Yuvipanda) [17:35:10] Coren: ^ [17:35:16] ottomata, why does that group need to be added to bastiononly whereas deployment/restricted/parsoid-admin/ocg-render-admins users do not? [17:35:18] (03PS2) 10Yuvipanda: labstore: Add systemd units for replicating NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/226748 [17:37:29] greg-g: I'm about to deploy https://gerrit.wikimedia.org/r/#/c/226642/, with your agreement. Slight risk of breaking all wikimedia pageviews [17:38:27] awight: hmm, no brandon around, who's your ops person to help re varnish or whatever if things go sour? [17:39:46] Jeff_Green: hey, what're you doing for the next hour? [17:40:04] working on redis monitoring? [17:40:05] what's up? [17:40:15] I'm hoping to futz with the banner impression logging, unfortunately [17:40:20] https://phabricator.wikimedia.org/T106624 [17:40:43] I'll point CentralNotice to hit a /beacon* endpoint, which is a much cheaper varnish 204, rather than serving "" from PHP [17:40:52] * awight dodges rotten fruits [17:41:08] awight: why today? [17:41:18] greg-g: It can wait [17:41:33] just checking if it was "because this is messing shit up for us" [17:41:34] also what's left in that ticket to be done? [17:42:00] greg-g: I think its just taking a bit of load off of the app server cluster [17:42:08] YuviPanda: Kinda sad we have no clear provision for parametrizing source and dest directories so that we could use a single template. [17:42:43] Coren: hmm, we actually could, if we need to. let me actually do that. [17:42:57] Jeff_Green: updated the ticket to make the status clear [17:43:04] awight: I trust you, but, I think it's not worth a Friday deploy today [17:43:05] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 50.00% of data above the critical threshold [35.0] [17:43:24] greg-g: wisdom I can live by, no problem. [17:43:40] awight: now, with your new found free time, go take some meeting minutes [17:43:52] lol harder than it sounds [17:44:03] :) :) [17:44:12] step 1) get invited to meetings [17:45:23] hahaha [17:48:19] greg-g: pleeeaaaase [17:48:26] this has been an issue for literally years [17:48:35] i'll own it if there's any fallout (but there won't be) [17:48:43] please please please [17:49:15] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 57.14% of data above the critical threshold [35.0] [17:49:16] please please please please please please please please please [17:49:19] lol [17:49:26] !!!!!!!!!!!!!!!!!!!!!!!!! [17:49:33] the answer to this question will determine if you do it or not [17:49:40] did you type each please or copy/paste? [17:49:47] typed! [17:49:51] :) [17:50:04] let's see, ma.rk is gone, I think? [17:50:18] (03PS3) 10Yuvipanda: labstore: Add systemd units for replicating NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/226748 [17:50:23] (03CR) 10jenkins-bot: [V: 04-1] labstore: Add systemd units for replicating NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/226748 (owner: 10Yuvipanda) [17:50:25] Coren: ^ [17:50:35] (03PS4) 10Yuvipanda: labstore: Add systemd units for replicating NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/226748 [17:50:35] ori: if you own it, do it [17:50:45] awight: Jeff_Green ^ [17:51:51] my vote is next week [17:52:16] baahaha. There is amazing potential for fallout. Jeff_Green: Monday? [17:52:18] (03CR) 10coren: [C: 031] "Me gusta." [puppet] - 10https://gerrit.wikimedia.org/r/226748 (owner: 10Yuvipanda) [17:52:31] awight: monday should work ya [17:52:37] (03CR) 10Yuvipanda: [C: 032] labstore: Add systemd units for replicating NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/226748 (owner: 10Yuvipanda) [17:52:38] ok, calmer voices have squashed the eager beaver [17:52:39] wait [17:52:49] no, come on [17:52:51] Jeff_Green: Things I'll need to ask u for are fiddling with *ahem* DjangoBannerStats settings [17:53:01] and the log replication [17:53:13] Jeff_Green: I will 100% vouch for this [17:53:24] RECOVERY - Persistent high iowait on labstore2001 is OK Less than 50.00% above the threshold [25.0] [17:53:25] this will keep me insane all weekend [17:53:33] awight ya. i vaguely remember how all that works but haven't touched it in a year, so it's a significant context switch to figure it all out [17:53:43] ori: hehe, if Jeff_Green wants to show you how to adjust FR cluster log replication... [17:54:03] pretty sure there's some perl in there :-) [17:54:04] awight: what do you mean? [17:54:08] (03PS1) 10Yuvipanda: labstore: Remove extra base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226749 [17:54:13] (03CR) 10jenkins-bot: [V: 04-1] labstore: Remove extra base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226749 (owner: 10Yuvipanda) [17:54:22] (03PS2) 10Yuvipanda: labstore: Remove extra base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226749 [17:54:30] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Remove extra base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226749 (owner: 10Yuvipanda) [17:54:46] ori: there's a pipeline to get stuff from wherever udp2log is collecting it to an nfs mount in fundraising where the banner consumers can consume it [17:54:52] ori: just that, I really appreciate your offer of support, and tangible support received already, but there is Jeff_Green stuff you would have to learn instantaneously, to help us do this migration [17:55:06] Jeff_Green: already done [17:55:12] Jeff_Green: https://gerrit.wikimedia.org/r/#/c/226654/ [17:55:14] and verified [17:55:33] by curling the endpoint from the shell and verifying that 1:100 make it to the log file [17:55:33] looking [17:55:33] ori: this is a second step, to copy from there to barium in frack [17:56:06] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.092 second response time [17:56:21] where's the code for that? [17:56:52] good question [17:57:39] ori: There's also some business about puppetized /etc config files on barium [17:57:41] files/misc/scripts/rotate_fundraising_logs [17:57:45] RECOVERY - HHVM rendering on mw1090 is OK: HTTP OK: HTTP/1.1 200 OK - 72057 bytes in 0.136 second response time [17:57:47] We might need to jiggle those [17:58:03] https://wikitech.wikimedia.org/wiki/Fundraising_Analytics/Impression_Stats [17:58:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 50.00% of data above the critical threshold [500.0] [17:59:52] (03CR) 10Ori.livneh: [C: 04-1] "MobileContext::singleton()->getMobileUrl( $wgCentralBannerRecorder ) converts relative URLs to the empty string, apparently." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [18:00:18] ori: wow, thx. I only tested on desktop [18:01:03] i don't see anything in fundraising puppet having to do with the log file names [18:01:44] * greg-g notes down reminder: "To get ori to do something, tell him he has to wait" [18:02:35] (03PS1) 10Yuvipanda: base: Properly support not specifying service {} in service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226750 [18:02:40] (03CR) 10jenkins-bot: [V: 04-1] base: Properly support not specifying service {} in service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226750 (owner: 10Yuvipanda) [18:02:43] baahahahaha [18:02:45] (03PS2) 10Yuvipanda: base: Properly support not specifying service {} in service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226750 [18:02:51] (03CR) 10Yuvipanda: [C: 032 V: 032] base: Properly support not specifying service {} in service_unit [puppet] - 10https://gerrit.wikimedia.org/r/226750 (owner: 10Yuvipanda) [18:04:49] awight: https://gerrit.wikimedia.org/r/226751 [18:05:51] (03PS1) 10Yuvipanda: labstore: Fix boolean conditional to do actual needed check [puppet] - 10https://gerrit.wikimedia.org/r/226752 [18:05:57] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fix boolean conditional to do actual needed check [puppet] - 10https://gerrit.wikimedia.org/r/226752 (owner: 10Yuvipanda) [18:06:00] (03PS2) 10Yuvipanda: labstore: Fix boolean conditional to do actual needed check [puppet] - 10https://gerrit.wikimedia.org/r/226752 [18:06:06] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Fix boolean conditional to do actual needed check [puppet] - 10https://gerrit.wikimedia.org/r/226752 (owner: 10Yuvipanda) [18:08:23] (03PS1) 10Yuvipanda: labstore: Make storage-replicate executable [puppet] - 10https://gerrit.wikimedia.org/r/226754 [18:08:25] Coren: ^ heh :) [18:08:30] (03CR) 10jenkins-bot: [V: 04-1] labstore: Make storage-replicate executable [puppet] - 10https://gerrit.wikimedia.org/r/226754 (owner: 10Yuvipanda) [18:08:33] (03PS2) 10Yuvipanda: labstore: Make storage-replicate executable [puppet] - 10https://gerrit.wikimedia.org/r/226754 [18:08:39] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Make storage-replicate executable [puppet] - 10https://gerrit.wikimedia.org/r/226754 (owner: 10Yuvipanda) [18:08:41] (03PS1) 10Ori.livneh: Follow-up for Ia9a53a6a49: rotate beaconImpression logs [puppet] - 10https://gerrit.wikimedia.org/r/226755 [18:09:19] Jeff_Green: ^ [18:11:29] looks fine to me, I'll merge it once jenkins returns from death [18:12:25] (03PS1) 10Yuvipanda: labstore: Increase resolution of snapshot naming to include HHMMSS [puppet] - 10https://gerrit.wikimedia.org/r/226757 [18:12:27] Coren: ^ [18:12:30] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1480261 (10coren) 3NEW [18:12:30] (03CR) 10jenkins-bot: [V: 04-1] labstore: Increase resolution of snapshot naming to include HHMMSS [puppet] - 10https://gerrit.wikimedia.org/r/226757 (owner: 10Yuvipanda) [18:12:35] (03PS2) 10Yuvipanda: labstore: Increase resolution of snapshot naming to include HHMMSS [puppet] - 10https://gerrit.wikimedia.org/r/226757 [18:13:10] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1480274 (10yuvipanda) Since we have to specify the name of the host explicitly in some way, we can also probably specify the key explicitly? via hiera. [18:13:12] ori: i'm unable to merge for one of several possible gerrity reasons [18:13:16] (03CR) 10coren: [C: 031] "Moar resomolutions." [puppet] - 10https://gerrit.wikimedia.org/r/226757 (owner: 10Yuvipanda) [18:13:18] (03PS2) 10Ori.livneh: Follow-up for Ia9a53a6a49: rotate beaconImpression logs [puppet] - 10https://gerrit.wikimedia.org/r/226755 [18:13:21] Jeff_Green: it needed rebase [18:13:25] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Increase resolution of snapshot naming to include HHMMSS [puppet] - 10https://gerrit.wikimedia.org/r/226757 (owner: 10Yuvipanda) [18:13:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up for Ia9a53a6a49: rotate beaconImpression logs [puppet] - 10https://gerrit.wikimedia.org/r/226755 (owner: 10Ori.livneh) [18:13:46] YuviPanda: you merged mine too right? [18:14:02] oh no [18:14:03] (03PS3) 10Ori.livneh: Follow-up for Ia9a53a6a49: rotate beaconImpression logs [puppet] - 10https://gerrit.wikimedia.org/r/226755 [18:14:04] hmm, no it didn't show up [18:14:06] you made it require another rebase [18:14:06] heh [18:14:35] (03CR) 10Ori.livneh: [V: 032] Follow-up for Ia9a53a6a49: rotate beaconImpression logs [puppet] - 10https://gerrit.wikimedia.org/r/226755 (owner: 10Ori.livneh) [18:14:43] there we go [18:14:46] Coren: am going to delete others20150724 [18:14:56] Coren: objections? [18:15:18] YuviPanda: It's not necessary, but if you're about to make another snapshot then it's clearly not needed. [18:15:27] Coren: yeah, other snapshot already made [18:15:27] YuviPanda: Destroy away! [18:15:32] !log running others20150724 on labstore1002 [18:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:50] err [18:15:56] Coren: lvremove is not the right thing to do? [18:16:10] or is it lvremove labstore/others20150724 [18:16:13] YuviPanda: Sure it is, if you intend to remove the lv. :-) [18:16:22] Yes, vg/lv [18:16:28] Jeff_Green: {{done}}. I don't think I have access to the frack cluster, so I can't poke around barium to see if there are non-puppetized things in /etc that would need to be updated. [18:16:36] Coren: so it's lvremove labstore/others20150724 [18:16:37] doing [18:16:49] * YuviPanda is slightly less but still fairly nooby on lvm [18:17:03] ori: as long as the logs are making it to nfs we can fix that parter later [18:17:10] er s/parter/part/ [18:17:13] !log removed labstore/others20150724 on labstore1002 [18:17:19] YuviPanda: lvm is trivial once you get the pv/vg/lv relationship. [18:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:55] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:19:04] (03PS6) 10Awight: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) [18:19:18] Coren: btw, we can probably move the ionice to systemd as well [18:19:29] ori: ^ using the absolute URL there [18:20:09] YuviPanda: It's not clear that we'd want the non-rsync bits to be ioniced - though I suppose it wouldn't harm. What benefit do you see to having the whole thing niced? [18:20:43] (03CR) 10Ori.livneh: [C: 031] Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [18:21:36] awight: tested that URL with getMobileUrl() too, looks good [18:21:49] Coren: mostly simplicity. IMO it shouldn't be the script's responsibility to do figure out IO priority [18:21:50] enwiki's is correctly set to string(38) "//en.m.wikipedia.org/beacon/impression" [18:22:21] thrilling. [18:22:28] so, what do you say? [18:22:53] ori: I'm waiting for the log rotation patches to settle, if that looks good I'll try to throw the switch today. [18:23:14] awight: also, why don't we do testwiki first? [18:23:23] I'm... not sure I see why moving the ionice from one place to another simplifies things; and hiding that away in the systemd config seems less clear to me. [18:23:41] ori: sure, but it's not going to test much. The important stuff doesn't have staging: log rotation and processing [18:24:08] I'm pretty confident that the CentralNotice part of the change is safe [18:24:09] awight: yeah, but my as well anyway. [18:24:12] sure! [18:24:18] i'll submit a patch [18:24:33] I only know how to set labs config, where do you configure testwiki? [18:25:10] Coren: hmm, I guess I don't see it as 'hiding', but that's ok, it's fairly immaterial now [18:25:25] Coren: is there any way we can get status updates from rsync into the log? [18:26:02] YuviPanda: rsync sucks at that. Either you have --progress on which spams you ridiculously but uselessly (because it's per-file) or you don't and it remains completely quiet. [18:26:34] Coren: so how about we have it spit out list of files and just print every 1000th file or something? [18:26:39] (to the log) [18:26:45] that'll give us what we want - an idea of where things are [18:26:53] (03PS7) 10Ori.livneh: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [18:26:55] (03PS1) 10Ori.livneh: testwiki: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226760 [18:27:00] awight: ^ [18:27:04] (rebased yours on top) [18:27:37] rsync copies in arbirary (directory) order IIRC, and won't output progress information at all for "just checking" which is 90% of what it does. [18:28:04] Coren: hmm, when we were doing manual rsyncs how were we checking progress? [18:28:29] (03CR) 10Awight: [C: 031] testwiki: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226760 (owner: 10Ori.livneh) [18:28:32] I was checking progress by looking at what directories were missing at the target. [18:28:45] (03CR) 10Ori.livneh: [C: 032] testwiki: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226760 (owner: 10Ori.livneh) [18:28:59] Coren: ah, I see. [18:29:25] Rsync progress output is really unusable for anything beyond "yep, it's doing something" and even then it sucks at that when the filesystems are mostly the same. :-) [18:29:33] heh [18:29:40] fair enough. [18:29:52] !log ori Synchronized wmf-config/CommonSettings.php: Idfe1fa60: testwiki: Point to a no-op /beacon URL rather than Special:RecordImpression (duration: 00m 12s) [18:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:18] !log Depooled Precise image scalers (mw1159 and mw1160) [18:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:57] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1480322 (10yuvipanda) [18:31:06] Coren: I'm filling in blocked-by for https://phabricator.wikimedia.org/T106474 [18:31:17] Coren: we will have to create new ones for the 'cleanup old LVs' [18:31:24] YuviPanda: only old paramiko's, right? [18:31:39] valhallasw`cloud: not sure, this one is the one shipping with jessie [18:31:43] (haven't investigated fully yet) [18:31:49] so probably outdated as heck :P [18:31:52] :P [18:31:58] iirc someone had issues with mysql workbench which uses paramiko [18:32:04] upgrading paramiko solved the issue [18:32:09] (which was the new ciphers before the revert) [18:32:15] 1.15.1 [18:32:55] ori: we're currently sampling RI calls by default, so you'll want to also set $wgCentralNoticeSampleRate = 1 for testwiki [18:33:14] valhallasw`cloud: latest is 1.15.2 and that has no new additions [18:33:23] valhallasw`cloud: that was cipher support, this is key exchange support. [18:33:29] awight: naw, no need. that calls to the beacon URL end up in the log file we already verified [18:33:34] valhallasw`cloud: newest paramiko supports our default ciphers just not our kex mechanisms [18:33:56] Ohh [18:34:00] ori: this is a sampling rate on whether or not we'll make the S:RI callback in the first place, from the client. [18:34:05] I see, ok [18:34:29] ori: so we need to stop sampling in order to test whether the config will work [18:35:00] awight: I just checked the value of mw.config.get( 'wgCentralBannerRecorder' ) [18:35:12] ok, that's fine [18:35:50] awight: how do the log rotation patches look? puppet should have ran everywhere by now. [18:36:04] ori: Good news: barium:/archive/udplogs/2015/beaconImpressions-sampled100.tsv-20150724-183001.log.gz [18:36:08] \o/ [18:36:30] shall I pull the trigger, then? [18:37:02] please please please please please please please please please please please [18:37:21] ro-ro-ro-rotate your logs [18:37:33] ori: hehe. Sure, please do. just leave a !log so I have the timestamp for the deployment [18:37:50] (03PS1) 10Yuvipanda: labstore: Set logging level to INFO fo storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226764 [18:37:52] I'm still fiddling with the log parsing code, but that can happen offline [18:37:55] (03CR) 10jenkins-bot: [V: 04-1] labstore: Set logging level to INFO fo storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226764 (owner: 10Yuvipanda) [18:38:02] (03PS2) 10Yuvipanda: labstore: Set logging level to INFO fo storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226764 [18:38:11] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Set logging level to INFO fo storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226764 (owner: 10Yuvipanda) [18:38:12] !log Merging Ib7c7861e: Point to a no-op /beacon URL rather than Special:RecordImpression [18:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:28] thx! [18:38:36] (03CR) 10Ori.livneh: [C: 032] Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [18:38:42] (03Merged) 10jenkins-bot: Point to a no-op /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) (owner: 10Awight) [18:39:14] !log ori Synchronized wmf-config/CommonSettings.php: Ib7c7861e: Point to a no-op /beacon URL rather than Special:RecordImpression (duration: 00m 12s) [18:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:22] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1480360 (10coren) Hm. Specifying the key would work, I suppose, since we need a hack to get paramiko to work at all anyways... I wonder when upstream is going to upgrade the... [18:39:28] awight: up to 5 minutes for startup module change to propagate [18:39:54] yep [18:42:19] (03PS1) 10Yuvipanda: labstore: Clean up logging format for storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226765 [18:42:24] (03CR) 10jenkins-bot: [V: 04-1] labstore: Clean up logging format for storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226765 (owner: 10Yuvipanda) [18:42:29] (03PS2) 10Yuvipanda: labstore: Clean up logging format for storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226765 [18:42:35] awight: http://i.kinja-img.com/gawker-media/image/upload/s--q1JQzk0v--/qtr9wvgsujidg5zri4im.jpg [18:42:40] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Clean up logging format for storage-replicate [puppet] - 10https://gerrit.wikimedia.org/r/226765 (owner: 10Yuvipanda) [18:43:28] I see the new URL... [18:43:52] Weirdly, I can log into everything but erbium, so I can't check that the right file is filling up. [18:44:01] awight: I've been watching the rate at which {beacon,banner}Impressions-sampled100.tsv.log is growing / shrinking [18:44:06] yeah [18:44:10] the right file is filling up [18:46:11] awight: yep, bannerImpressions-sampled100.tsv.log stopped growing, beaconImpressions-sampled100.tsv.log growing at the rate the bannerImpressions was growing pre-deploy [18:46:29] wicked. thanks for pushing us towards the light! [18:48:25] * ori jumps up and down [18:49:46] http://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [18:49:52] shoudl I see anything there? [18:50:08] I mean, in response to the /beacon change [18:51:22] nah it's not API [18:51:25] it's the regular application servers [18:51:27] it's a special page! [18:51:41] but empty responses are not a lot of bytes! [18:53:25] It's possible we were already under the radar for app server load, cos we're only making this silly request every 1/100 pageviews. [18:53:41] "only" for some definition of a billion people / 100 [18:54:04] where are campaigns running atm? [18:55:05] Fundraising is in JP, where it's currently nite-nite. There are community campaigns in he-IL, francophone countries, IT, and AR. [18:55:44] * ori watches https://performance.wikimedia.org/ [18:56:25] Looks like we were getting roughly 1M pageviews/hr to the Special page [18:57:13] headline: "Wikimedia page views drop by 730m for month of August" [18:57:42] omg. All too true [18:57:55] i *think* analytics exempts those already [18:57:57] let's hope [18:58:26] on a side note, we're finally poised to use HHVM for fundraising wikis... looking forward to that [18:59:00] awight: Awesome. MW is edging closer to pulling the plug on 5.3! [18:59:21] awight, Jeff_Green, greg-g -- thanks a bunch!! [18:59:22] Of all things, we were blocked due to using a single php-core api which was too stupid to port to hhvm :) -- number of days in the month [18:59:33] James_F: One can hope! [18:59:56] ori: thanks for patiently explaining all the things we should do ;) [19:00:00] ori: enjoy your weekend :) [19:00:15] awight: Ha. [19:01:14] greg-g: YOU CAN'T MAKE HIM [19:03:02] YuviPanda: right [19:03:10] ori: dont enjoy your weekend yet [19:04:25] .. [19:05:42] oh, you're just making a joke [19:05:54] i thought you were about to point out some grave issue [19:05:59] oh, my bad [19:06:04] 6operations, 6Engineering-Community: date/budget proposal for 2015 Ops Offsite - https://phabricator.wikimedia.org/T89023#1480452 (10Rfarrand) Tentative dates have be proposed by @mark This looks likely to take place in Sep or Oct 2015 - specifics of location and time have not been finalized. [19:06:09] joke back-fire [19:06:38] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1480470 (10brion) The state of video scaling in Trusty seems to be completely fubar if my testing... [19:07:16] greg-g: I think ori's just trolling you [19:07:30] nah [19:08:03] !log remove others20150724183453 on labstore1002 [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:07] since video transcodes still work, I assume video scalers are not running Trusty yet [19:10:29] brion: yeah. I was just reading T103335. :( [19:10:40] whee :) [19:10:40] thanks for testing that and reporting your findings [19:10:45] i love it when things break upstream \o/ [19:11:02] i seem to recall some discussion in debian about whether to bring back ffmpeg [19:11:04] did i hallucinate that? [19:11:16] ori: so it sounds like ffmpeg is back, but post-jessie [19:11:25] ffmpeg is back yes. [19:11:32] so if we run servers on jessie, and want ffmpeg, we have to use a backport [19:11:55] heh, we're swarming with debian developers recently :) [19:13:18] if avconv and ffmpeg2theora work on jessie and are not-broken that may be fine for us [19:13:24] i'll have to check what it supports by default [19:13:27] but the trusty versions are fucked :( [19:14:07] * brion fires up more VMs [19:14:25] brion: i wouldn't bother (yet); we haven't really tested the app server stack on jessie at all [19:14:36] ok [19:14:39] there are probably many issues which would encumber a migration to jessie for any app server role [19:14:43] fixing trusty seems more likely [19:15:00] HHVM we don't have jessie builds at all [19:15:11] fun! no rush then :D [19:15:13] (03PS1) 10Yuvipanda: labstore: Fix timestamp for Lockdir [puppet] - 10https://gerrit.wikimedia.org/r/226890 [19:15:18] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fix timestamp for Lockdir [puppet] - 10https://gerrit.wikimedia.org/r/226890 (owner: 10Yuvipanda) [19:15:22] (03PS2) 10Yuvipanda: labstore: Fix timestamp for Lockdir [puppet] - 10https://gerrit.wikimedia.org/r/226890 [19:15:28] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Fix timestamp for Lockdir [puppet] - 10https://gerrit.wikimedia.org/r/226890 (owner: 10Yuvipanda) [19:15:39] brion: are all the issues that you linked to in your comment ones that are currently _not_ affecting precise? [19:15:40] legoktm: thx for picking up the abusefilter thing [19:16:00] the comments on https://phabricator.wikimedia.org/T55863 seem to imply it's currently broken [19:16:11] (03PS1) 10coren: Make sure storage-replicate only removes the lockdir if it held the lock [puppet] - 10https://gerrit.wikimedia.org/r/226892 [19:16:23] YuviPanda: ^^ said trivial fix [19:16:23] ori: the opus and vp9 codecs are missing in precise [19:16:35] the broken webm and ogv output bugs are not present in precise [19:16:51] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1480521 (10GWicke) @moritzmuehlenhoff, is there anything stopping us from importing those packages? [19:16:57] (03PS2) 10Yuvipanda: Make sure storage-replicate only removes the lockdir if it held the lock [puppet] - 10https://gerrit.wikimedia.org/r/226892 (owner: 10coren) [19:17:04] brion: right, so it's those two bugs that we should consider blockers, right? [19:17:06] (03CR) 10Yuvipanda: [C: 032 V: 032] Make sure storage-replicate only removes the lockdir if it held the lock [puppet] - 10https://gerrit.wikimedia.org/r/226892 (owner: 10coren) [19:17:36] Coren: I still don't like the 'handle errors with return code' pattern - we aren't handling them in a lot of places. if we switch to exceptions, we just handle them where we want and have them crash elsewhere, which is probably better than silently failing which is what we do now [19:17:36] ori: yeah sounds right [19:18:10] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1480526 (10brion) >>! In T103335#1480470, @brion wrote: > The state of video scaling in Trusty see... [19:19:16] YuviPanda: We want it to silently fail in some cases, though I suppose we could catch them. Hrm. [19:19:33] Coren: yes, those should be the.... (dadam!) exception, not the rule :D [19:19:43] har de har har :) [19:19:44] can someone teach me how to clear the ldap cache on polonium? [19:20:07] (the one that blocks email delivery until the negative cache expires and it re-syncs with lda=p [19:20:09] ) [19:20:23] YuviPanda: I just find badly logged stack traces to be so completely crap at being informative rather than a crafted error message. But meh. [19:20:35] Coren: 'badly logged'? [19:21:05] Coren: I have the opposite bias, since a stack trace accurately tells me where to look at, and if accompanied by stderr super useful. [19:21:07] YuviPanda: There is no way to specify "well, if there is an exception here, log this at X loglevel" [19:21:28] Coren: there is. you catch it and do logging.info / error / exception. there's even a logging.exception() that does the right thing by default. [19:21:49] Coren: and since this is systemd you get stderr output in logs, which will give you exception stacktrace with timestamps [19:21:56] YuviPanda: If you have to put a try/catch block around every place, how is that a win vs a simple if err: ? [19:22:12] Coren: you don't put it around every place, you only put it around the places you want to catch the exception and recover. [19:22:34] you aren't expecting mkdir to fail, so if it fails you want the process to crash [19:23:19] Well, that's a bad example because that's indeed the primary lock mechanism but I see your point. [19:24:20] Coren: so you can either use exceptions or check each err and do an exit. [19:24:24] and the former is much better [19:39:29] Coren: can you also respond on https://phabricator.wikimedia.org/T106590 [19:39:50] * Coren looks [19:40:24] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 3.29520543234e-71 [19:41:42] e-71 that is not a lot [19:41:45] ottomata: ^^^^ [19:41:58] seems there is some funny roudinig issue [19:44:15] Coren: I've also pinged you on https://phabricator.wikimedia.org/T98183 and a couple of other tickets (including the toolserver ones). do respond when you can :) [19:44:35] Yeah, I need to finish catching up email and phab and other paperwork. [19:44:59] Ima set up a couple hours over the weekend to do that in peace now that I no longer spend over half my days sitting in the bathroom. :-) [19:50:31] Coren: heh, ok [20:05:07] (03PS1) 10Faidon Liambotis: Revert "rubocop: Fixed Style/TrailingWhitespace offense" [puppet] - 10https://gerrit.wikimedia.org/r/226898 [20:05:32] (03PS2) 10Faidon Liambotis: Revert "rubocop: Fixed Style/TrailingWhitespace offense" [puppet] - 10https://gerrit.wikimedia.org/r/226898 [20:07:33] (03CR) 10Faidon Liambotis: [C: 032] Revert "rubocop: Fixed Style/TrailingWhitespace offense" [puppet] - 10https://gerrit.wikimedia.org/r/226898 (owner: 10Faidon Liambotis) [20:10:12] (03PS1) 10GWicke: Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) [20:13:23] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, and 3 others: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1480795 (10awight) [20:19:21] fyi, just gave James_F permission to deploy a fix to an UBN! breakage to VE's template editing [20:19:27] (03PS1) 10RobH: adding in pagerduty paging to icinga [puppet] - 10https://gerrit.wikimedia.org/r/226903 [20:21:00] (03PS2) 10RobH: adding in pagerduty paging to icinga [puppet] - 10https://gerrit.wikimedia.org/r/226903 [20:21:11] (03CR) 10RobH: [C: 032] adding in pagerduty paging to icinga [puppet] - 10https://gerrit.wikimedia.org/r/226903 (owner: 10RobH) [20:23:03] robh: and now the wait for something critical to break ;) [20:23:30] (03CR) 10Aaron Schulz: Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) (owner: 10GWicke) [20:23:32] well we get a false ipv6 alarm every weekend, bblack always beats me to irc and finds its not a real alarm ;D [20:23:42] so figured i'll get it merged before the weekend, since its just appending a single contact. [20:24:34] * robh is still babysitting the puppet run on neon. [20:24:39] smart decision making [20:24:44] (03CR) 10GWicke: Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) (owner: 10GWicke) [20:25:31] AaronSchulz: so with 1 it will never be retried? [20:26:31] reloaded without issue ;] [20:27:25] yes [20:27:44] (03PS2) 10GWicke: Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) [20:27:47] it's always worded as "attempts" not "retries" [20:27:56] just popping a job counts as attempt #1 [20:28:03] AaronSchulz: I think the 'recycled' threw me off [20:28:47] how about "How many times to let jobs be executed before abandoning" [20:30:32] (03PS3) 10GWicke: Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) [20:32:27] (03PS1) 10Hashar: debian: fix email in changelog [software/conftool] - 10https://gerrit.wikimedia.org/r/226909 [20:33:38] gwicke: yeah that comment was misleading [20:34:53] AaronSchulz: thanks for noticing the setting issue in the first place ;) [20:36:37] (03PS1) 10Hashar: debian: fix lintian error about bad dist name [software/conftool] - 10https://gerrit.wikimedia.org/r/226910 [20:36:37] robh: grats, a small change didn't page every and inherently accidentally test the feature you wanted to test but didn't because it failed :D [20:36:43] (03CR) 10Aaron Schulz: [C: 031] Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) (owner: 10GWicke) [20:36:49] !log krenair Synchronized php-1.26wmf15/extensions/VisualEditor/modules/ve-mw/ui: https://gerrit.wikimedia.org/r/#/c/226907/ (duration: 00m 12s) [20:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:19] Krenair: Works now. Whee! [20:38:04] though it could have paged and it owuld have been ok [20:38:47] really? [20:39:19] i think ariel is the only ops 24x7 auto page right now. [20:39:27] we have fundraising folks in 24x7 too [20:39:38] oh, brandon is 24x7 =P [20:40:14] Daniel removed his though idk if he would be 24x7 either :p [20:40:58] but i dont know how my chance could have paged anyone, since they arent in pager duty ;D [20:41:12] perhaps [20:51:46] 6operations, 10Analytics-Cluster: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1480936 (10Ottomata) OO! I was able to build this! Amazing! Alex's Makefiles are super simple! I had some merge conflicts and auto-merges I wasn't able to... [20:58:16] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1480958 (10RobH) this quote is being tracked on https://rt.wikimedia.org/Ticket/Display.html?id=9507 [21:27:31] (03PS1) 10Ori.livneh: wmflib: validate_array_re( array $items, string $re ) [puppet] - 10https://gerrit.wikimedia.org/r/226921 [21:28:30] (03CR) 10Ori.livneh: [C: 032 V: 032] wmflib: validate_array_re( array $items, string $re ) [puppet] - 10https://gerrit.wikimedia.org/r/226921 (owner: 10Ori.livneh) [21:29:14] (03CR) 10Ori.livneh: "@Tim: I added https://gerrit.wikimedia.org/r/#/c/226921/ for this. I hope it's useful." [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:35:00] 6operations, 10CirrusSearch, 6Discovery: Release wikimedia-extra plugin for Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106161#1481071 (10Nemo_bis) Thanks. Where is the repository? http://mvnrepository.com/artifact/org.wikimedia.search/extra/ links a non-existing repo on github, https://github.c... [21:37:20] 6operations, 10CirrusSearch, 6Discovery: Release wikimedia-extra plugin for Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106161#1481072 (10Nemo_bis) Nevermind, found: https://git.wikimedia.org/project/search [21:57:49] !log running mwscript populateContentModel.php --wiki=enwiki --ns=all --table=page [21:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:27] (03PS10) 10Rush: Add Phragile module. [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [22:14:07] (03CR) 10Rush: [C: 032 V: 032] "this is still doing some labs only things but is bound for labs only so that seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [22:25:06] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1481198 (10MoritzMuehlenhoff) >>! In T103335#1480470, @brion wrote: > The state of video scaling i... [22:39:10] robh: could you send me an invite to the new pagerduty setup? [22:39:22] yea you are on there [22:39:36] when you go to login it'll be one of the options [22:39:48] gotcha thanks [22:40:11] with our non-rotation setup not a big fan of the on/off emails [22:40:13] I think [22:40:32] yea me either [22:40:39] its annoying since we're intentionally rotating daily [22:41:18] when i set up folks post testing, i'll uncheck that one for them when i invite ;D [22:41:40] chasemp: actually, im logged in, ant me to change yours? [22:41:43] cmjohnson1: ^ [22:41:50] sure [22:42:30] (03PS1) 10Yuvipanda: labstore: Make script use exceptions instead of return value checking [puppet] - 10https://gerrit.wikimedia.org/r/226937 [22:42:31] i disabled on his as well, its pointless since its daily [22:42:35] (03CR) 10jenkins-bot: [V: 04-1] labstore: Make script use exceptions instead of return value checking [puppet] - 10https://gerrit.wikimedia.org/r/226937 (owner: 10Yuvipanda) [22:42:49] chasemp: also i added in the pagerduty email sms event to our icinga [22:42:59] k [22:43:01] so if anything pages over the weekend, you, myself, and chris will get both the normal sms [22:43:03] and pd [22:43:07] sounds good [22:46:35] (03PS2) 10Yuvipanda: labstore: Make script use exceptions instead of return value checking [puppet] - 10https://gerrit.wikimedia.org/r/226937 [23:14:20] (03PS1) 10Alex Monk: Also look at wikipedia.dblist [software] - 10https://gerrit.wikimedia.org/r/226939 (https://phabricator.wikimedia.org/T106897) [23:15:10] (03PS2) 10Alex Monk: maintain-replicas: Also look at wikipedia.dblist [software] - 10https://gerrit.wikimedia.org/r/226939 (https://phabricator.wikimedia.org/T106897) [23:17:13] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 62.50% of data above the critical threshold [35.0] [23:17:55] (03Abandoned) 10Chmarkine: Change protocol relative to https [puppet] - 10https://gerrit.wikimedia.org/r/226731 (owner: 10Chmarkine) [23:18:07] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1481447 (10RobH) [23:18:33] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1481449 (10dr0ptp4kt) I pinged on wikitech-l one last time. [23:20:22] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1481458 (10RobH) I've added a 24x7 pagerduty contact to the sms group, so it'll be notified of all the opsen pages. Chase, Chris, & myself are setup in pagerduty to get alerts. So we'll be getting... [23:27:33] RECOVERY - Persistent high iowait on labstore2001 is OK Less than 50.00% above the threshold [25.0] [23:29:00] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1481526 (10srijan) Hi! I am not able to login to stat1003. Here is what I am getting: $ ssh srijan@stat1003.eqiad.wmnet Permission denied (publickey). ssh_exchange_... [23:33:58] (03PS1) 10Alex Monk: Also add srijan to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/226941 (https://phabricator.wikimedia.org/T106407) [23:34:52] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1481540 (10Chmarkine) Have these communities been notified yet? [23:58:04] PROBLEM - puppet last run on mw2091 is CRITICAL puppet fail