[00:00:04] (03CR) 10Chad: [C: 032] clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 (owner: 10Chad) [00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T0000). Please do the needful. [00:00:05] ebernhardson and tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:28] \o [00:00:30] (03PS1) 10Ladsgroup: Enable ORES review tool in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339090 (https://phabricator.wikimedia.org/T151611) [00:00:45] jouncebot: next [00:00:45] In 13 hour(s) and 59 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T1400) [00:01:19] (03PS3) 10Dzahn: sshd-phab service config needs to be a template [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4) [00:01:31] I can SWAT [00:01:31] tgr: damn that's a lot of config patches :P [00:01:31] tgr: Will leave comments on change [00:01:54] (03Merged) 10jenkins-bot: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 (owner: 10Chad) [00:02:08] Hey, I added one item to SWAT [00:02:13] sorry for being late [00:02:21] thcipriani: mine are both the same, javascript only changes which remove some test code [00:02:30] AB test code [00:02:50] okie doke [00:03:29] are we over the SWAT limit? I can remove some of those config changes [00:03:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339041 (https://phabricator.wikimedia.org/T158698) (owner: 10Gergő Tisza) [00:03:53] OTOH the first three are trivial beta-only changes that can be synced without looking [00:03:53] I *think* we'll be ok [00:04:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012 (owner: 10Gergő Tisza) [00:04:29] (03CR) 10jenkins-bot: clean.py: Rework command execution, reduce code dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339032 (owner: 10Chad) [00:04:41] Whoops, forgot it was swat time already [00:04:47] thcipriani: My sync will be done before your first merge lands [00:04:58] RainbowSprinkles: no problem, go for it [00:05:05] !log demon@tin Synchronized scap/plugins/clean.py: More code cleanup (duration: 00m 40s) [00:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:03] (03Merged) 10jenkins-bot: Fix PageViewInfo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339041 (https://phabricator.wikimedia.org/T158698) (owner: 10Gergő Tisza) [00:06:35] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3045107 (10Dzahn) I have never used a draft and would not miss them. [00:07:01] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:21] (03CR) 10Chad: clean.py: Fix up l10nupdate-owned files on masters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [00:08:42] (03CR) 10jenkins-bot: Fix PageViewInfo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339041 (https://phabricator.wikimedia.org/T158698) (owner: 10Gergő Tisza) [00:08:56] (03PS2) 10Thcipriani: Fix Sentry URL scheme on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012 (owner: 10Gergő Tisza) [00:09:15] (03CR) 10Thcipriani: Fix Sentry URL scheme on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012 (owner: 10Gergő Tisza) [00:09:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012 (owner: 10Gergő Tisza) [00:09:43] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3045110 (10Paladox) >>! In T158656#3045107, @Dzahn wrote: > I have never used a draft and would not miss them. There being replaced with private edits :) [00:10:10] (03PS1) 10BryanDavis: Use POST when fetching parsed HTML for wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339092 (https://phabricator.wikimedia.org/T158715) [00:11:10] (03Merged) 10jenkins-bot: Fix Sentry URL scheme on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012 (owner: 10Gergő Tisza) [00:11:19] (03CR) 10jenkins-bot: Fix Sentry URL scheme on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339012 (owner: 10Gergő Tisza) [00:11:38] ebernhardson: your changes for both .12 and .13 are live on mwdebug1002 if there's anything to check there. [00:11:59] (03PS4) 10Thcipriani: Fix SiteConfiguration array merge syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656) (owner: 10Gergő Tisza) [00:12:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656) (owner: 10Gergő Tisza) [00:13:05] !log smalyshev@tin Started deploy [wdqs/wdqs@7768422]: Deploy 2.1.5RC WAR on 2001 for testing [00:13:07] thcipriani: code seems to load and not complain, tis about all i can check [00:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:31] !log smalyshev@tin Finished deploy [wdqs/wdqs@7768422]: Deploy 2.1.5RC WAR on 2001 for testing (duration: 00m 25s) [00:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:47] ebernhardson: wfm, ok, going live wmf.13 first [00:13:52] (03PS5) 10Chad: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) [00:14:27] (03Merged) 10jenkins-bot: Fix SiteConfiguration array merge syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656) (owner: 10Gergő Tisza) [00:14:36] (03CR) 10jenkins-bot: Fix SiteConfiguration array merge syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656) (owner: 10Gergő Tisza) [00:15:01] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:15:08] (03CR) 10BryanDavis: [C: 032] Use POST when fetching parsed HTML for wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339092 (https://phabricator.wikimedia.org/T158715) (owner: 10BryanDavis) [00:16:00] !log thcipriani@tin Synchronized php-1.29.0-wmf.13/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:339011|Turn off sister search AB test.]] T157942 (duration: 00m 43s) [00:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:05] T157942: turn off sistersearch A/B/C test on 2017-02-21 - https://phabricator.wikimedia.org/T157942 [00:16:28] (03PS6) 10Chad: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) [00:17:07] !log thcipriani@tin Synchronized php-1.29.0-wmf.12/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:339010|Turn off sister search AB test.]] T157942 (duration: 00m 39s) [00:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:37] (03CR) 10Chad: clean.py: Fix up l10nupdate-owned files on masters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [00:18:10] (03Merged) 10jenkins-bot: Use POST when fetching parsed HTML for wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339092 (https://phabricator.wikimedia.org/T158715) (owner: 10BryanDavis) [00:18:21] PROBLEM - Check systemd state on elastic1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:26] thcipriani: thanks! i'll keep an eye on the events coming in to make sure nothing crazy happens [00:18:31] PROBLEM - Elasticsearch HTTPS on elastic1043 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against elastic1043.eqiad.wmnet [00:20:40] thcipriani: the other conifg patches need testing on mwdebug; the two exception-related ones can go together [00:20:46] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:336747|Fix SiteConfiguration array merge syntax]] T157656 [[gerrit:339012|Fix Sentry URL scheme on beta]] [[gerrit:339041|Fix PageViewInfo config]] T158698 (beta-only changes) (duration: 00m 39s) [00:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:52] T158698: PageViewInfo does not work on beta - https://phabricator.wikimedia.org/T158698 [00:20:52] T157656: "Undefined index: wmgExtraLanguageNames" on beta - https://phabricator.wikimedia.org/T157656 [00:21:02] ^ 1043 seems to have gotten a bad certificate somehow :S will make a ticket for gehel shouldn't be a problem but needs to be depooled (if it is pooled) [00:21:20] tgr: yup, just looking at them now [00:22:05] (03PS10) 10Thcipriani: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [00:22:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [00:24:25] (03Merged) 10jenkins-bot: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [00:24:35] (03CR) 10jenkins-bot: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [00:25:32] (03PS3) 10BryanDavis: Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 [00:26:26] (03PS2) 10Chad: clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 [00:26:28] tgr: Set $wgSoftBlockRanges is on mwdebug1002, check please [00:28:22] (03PS4) 10Thcipriani: Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:29:06] hm, either that's not working or I have misconceptions on how it can be tested [00:29:37] tgr: whoops, forgot to rebase, sorry :( [00:30:10] tgr: ok, try now [00:30:38] (03PS1) 10Chad: clean.py: Minor abstraction of param handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339094 [00:30:57] works, thx [00:31:35] ok, will go live with initialisesettings first and then all of wmf-config in 2 syncs [00:33:12] (03CR) 10BryanDavis: [C: 032] "Tested via cherry-pick." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis) [00:33:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:324215|Set $wgSoftBlockRanges]] T154698 PART I (duration: 00m 40s) [00:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:47] T154698: Prevent contributions attributed to private and WMF IP addresses - https://phabricator.wikimedia.org/T154698 [00:33:59] (03Merged) 10jenkins-bot: Use IB3 library [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339066 (owner: 10BryanDavis) [00:34:53] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:324215|Set $wgSoftBlockRanges]] T154698 PART II (duration: 00m 42s) [00:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:58] ^ tgr live now [00:35:03] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:35:05] (03PS2) 10Chad: Scap clean: Automate purging of old deployment branches from gerrit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 [00:35:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:35:29] (03PS3) 10BryanDavis: Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 [00:35:50] (03PS2) 10BryanDavis: Run flake8 on both python2 and python3 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076 [00:36:28] (03Merged) 10jenkins-bot: Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:36:44] (03PS3) 10Thcipriani: Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:36:51] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:37:37] (03CR) 10jenkins-bot: Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:37:52] tgr: send exception channel to logstash is on mwdebug1002 if you want to check it there. [00:38:18] (03Merged) 10jenkins-bot: Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:38:19] thx, I will [00:38:47] although probably better to check them together [00:39:16] tgr: ok, they are both there now [00:39:43] (03CR) 10jenkins-bot: Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [00:40:32] (03CR) 10Dzahn: [C: 032] sshd-phab service config needs to be a template [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4) [00:44:01] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:44:56] thanks thcipriani, it seems to work [00:45:17] tgr: great! will sync [00:45:33] (03CR) 10Dzahn: "on iridium:" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4) [00:46:07] (03CR) 10Dzahn: "should have fixed it on jessie systems. yea. 9 ExecStart=/usr/sbin/sshd -D -f/etc/ssh/sshd_config.phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/338302 (owner: 1020after4) [00:48:08] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:323111|Send "exception" channel to logstash]] [[gerrit:323330|Do not send "exception-json" channel to logstash]] T136849 (duration: 00m 40s) [00:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:14] T136849: normalized_message is a JSON dump of the whole event for exceptions in beta logstash - https://phabricator.wikimedia.org/T136849 [00:48:14] ^ tgr live everywhere [00:48:21] thanks! [00:48:26] yw :) [00:48:44] (03CR) 10BryanDavis: [C: 032] Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 (owner: 10BryanDavis) [00:49:11] PROBLEM - Disk space on elastic1035 is CRITICAL: DISK CRITICAL - free space: / 5673 MB (12% inode=98%) [00:49:22] (03Merged) 10jenkins-bot: Fix flake8 E128 warnings [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339067 (owner: 10BryanDavis) [00:49:24] (03Merged) 10jenkins-bot: Run flake8 on both python2 and python3 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/339076 (owner: 10BryanDavis) [00:49:32] (03PS2) 10Thcipriani: Enable ORES review tool in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339090 (https://phabricator.wikimedia.org/T151611) (owner: 10Ladsgroup) [00:49:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339090 (https://phabricator.wikimedia.org/T151611) (owner: 10Ladsgroup) [00:49:42] (03PS3) 10Dzahn: contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 (https://phabricator.wikimedia.org/T158652) [00:50:21] Amir1: trying to remember the order of operations for setting up ores on new wikis: create tables then run maintenance scripts then sync? [00:50:30] (03PS2) 10Dzahn: labs_vagrant: drop precise support [puppet] - 10https://gerrit.wikimedia.org/r/337205 (https://phabricator.wikimedia.org/T143349) [00:50:58] thcipriani: sync, create tables, run maintenance [00:51:12] RECOVERY - Disk space on elastic1035 is OK: DISK OK [00:51:14] (03PS2) 10Dzahn: toollabs: drop precise-related monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/337207 (https://phabricator.wikimedia.org/T143349) [00:51:29] Amir1: ah, ok, so it doesn't freak out if it doesn't find the tables initially? [00:51:44] (03Merged) 10jenkins-bot: Enable ORES review tool in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339090 (https://phabricator.wikimedia.org/T151611) (owner: 10Ladsgroup) [00:51:47] hmm [00:51:51] you're right [00:51:52] (03CR) 10jenkins-bot: Enable ORES review tool in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339090 (https://phabricator.wikimedia.org/T151611) (owner: 10Ladsgroup) [00:52:15] create tables, run maintenance and then sync [00:52:25] (03CR) 10Dzahn: [C: 031] Introduce linters using rake [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/338387 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [00:52:41] oh oh [00:52:53] (I was thinking without syncing, can not access to the maintenance scripts /sql files) [00:52:59] jouncebot's grid job is going nuts :/ [00:53:36] (03CR) 10Dzahn: [C: 031] Introduce linters using rake [puppet/kafka] - 10https://gerrit.wikimedia.org/r/338385 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [00:53:47] the nickser recover stuff seems to work though :) [00:55:36] Amir1: ok, tables created [00:56:04] Amir1: should I fetch over to mwdebug for you to check or maintenance first? [00:56:33] without syncing we can't test it [00:57:07] :) [00:57:41] Amir1: ok, on mwdebug1002, check please [00:58:09] Thanks! [01:00:51] thcipriani: good! [01:01:19] Amir1: cool, so I'll sync initialisesettings and run the maintenance scripts, sound right? [01:01:44] awesome [01:01:46] thanks [01:03:34] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:339090|Enable ORES review tool in cswiki]] T151611 (duration: 00m 39s) [01:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:39] T151611: Enable ORES Review Tool on Czech Wikipedia - https://phabricator.wikimedia.org/T151611 [01:05:12] Amir1: ran checkmodelversions, running populatedatabase now [01:06:42] thcipriani: Awesome, Tested and works great [01:07:03] Amir1: cool, glad to hear it, thanks for testing :) [01:10:40] (03PS10) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [01:11:30] (03CR) 10Paladox: [C: 031] contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 (https://phabricator.wikimedia.org/T158652) (owner: 10Dzahn) [01:12:11] (03CR) 10Krinkle: contint: drop npm settings for precise (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337203 (https://phabricator.wikimedia.org/T158652) (owner: 10Dzahn) [01:13:23] (03PS5) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [01:13:35] Amir1: populatedatabase completed! [01:13:49] (03PS6) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [01:13:51] (03CR) 10jerkins-bot: [V: 04-1] nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [01:14:36] thcipriani: Awesome [01:14:46] we are done now. I'll make the announcement [01:15:11] (03PS11) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [01:19:11] (03PS6) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 [01:19:13] (03PS4) 10Tim Landscheidt: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030 [01:21:33] (03PS7) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [01:23:31] (03PS8) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [01:24:32] (03CR) 10Tim Landscheidt: "@chasemp: The on-demand query takes about 1.5 s (time /usr/local/sbin/tools-clush-generator map --observer-pass $OBSERVER_PASS --project t" [puppet] - 10https://gerrit.wikimedia.org/r/328030 (owner: 10Tim Landscheidt) [01:24:58] (03CR) 10Paladox: "I managed to get ssh working on a labs instance with phabricator installed :)." [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [01:27:59] (03CR) 10Chad: "I guess I should also grant force push rights on extensions on wmf/* branches to deployers, otherwise nobody will be able to do this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 (owner: 10Chad) [01:32:19] (03CR) 10Chad: "Oh, we already have it, nvm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 (owner: 10Chad) [01:48:11] PROBLEM - Disk space on elastic1035 is CRITICAL: DISK CRITICAL - free space: / 4362 MB (9% inode=98%) [01:51:51] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 264 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:54:11] RECOVERY - Disk space on elastic1035 is OK: DISK OK [01:56:51] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 415 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [01:56:51] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 264 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:00:21] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:00:46] (03CR) 10Krinkle: [C: 031] Scap clean: Automate purging of old deployment branches from gerrit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 (owner: 10Chad) [02:01:26] Krinkle: Thx for the review. There's a whole chain of stuff to land here :) [02:01:31] Should be much nicer soon [02:01:51] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 415 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [02:01:59] https://gerrit.wikimedia.org/r/#/q/topic:cleaner-scap+status:open is the full topic [02:04:11] PROBLEM - Disk space on elastic1035 is CRITICAL: DISK CRITICAL - free space: / 596 MB (1% inode=98%) [02:07:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 3694 MB (8% inode=98%) [02:09:07] gehel: Got 2 more of those ^ [02:10:31] PROBLEM - Disk space on elastic1044 is CRITICAL: DISK CRITICAL - free space: / 4631 MB (10% inode=98%) [02:11:23] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [02:12:09] elastic{1043,1039,1035} [02:12:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 4207 MB (9% inode=98%) [02:12:31] RECOVERY - Disk space on elastic1044 is OK: DISK OK [02:12:41] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:14:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 4753 MB (10% inode=98%) [02:28:11] RECOVERY - Disk space on elastic1035 is OK: DISK OK [02:29:21] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:29:21] RECOVERY - Disk space on elastic1039 is OK: DISK OK [02:32:42] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.12) (duration: 11m 48s) [02:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:09] (03CR) 10Subramanya Sastry: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [02:37:01] PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:42] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:42] PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:51] PROBLEM - DPKG on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:51] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:21] PROBLEM - SSH on bast3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:41] PROBLEM - configured eth on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:41] PROBLEM - dhclient process on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:42] PROBLEM - salt-minion processes on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:31] RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient [02:39:31] RECOVERY - configured eth on bast3001 is OK: OK - interfaces up [02:39:32] RECOVERY - salt-minion processes on bast3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:39:41] RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full [02:39:41] RECOVERY - DPKG on bast3001 is OK: All packages OK [02:39:42] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [02:40:01] RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set [02:40:21] RECOVERY - SSH on bast3001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [02:40:41] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:40:51] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [02:44:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 4391 MB (9% inode=98%) [02:46:21] RECOVERY - Disk space on elastic1039 is OK: DISK OK [02:46:42] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:51] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:31] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [02:47:41] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [02:54:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 4626 MB (10% inode=98%) [02:55:21] RECOVERY - Disk space on elastic1039 is OK: DISK OK [02:58:21] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [03:00:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 42 MB (0% inode=98%) [03:00:31] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:04:27] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 13m 54s) [03:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Feb 22 03:10:13 UTC 2017 (duration 5m 46s) [03:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:21] RECOVERY - Disk space on elastic1039 is OK: DISK OK [03:22:31] PROBLEM - Disk space on elastic1044 is CRITICAL: DISK CRITICAL - free space: / 5701 MB (12% inode=98%) [03:24:31] RECOVERY - Disk space on elastic1044 is OK: DISK OK [03:28:31] PROBLEM - Disk space on elastic1044 is CRITICAL: DISK CRITICAL - free space: / 4106 MB (9% inode=98%) [03:29:31] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [03:33:11] PROBLEM - Disk space on elastic1035 is CRITICAL: DISK CRITICAL - free space: / 4150 MB (9% inode=98%) [03:33:41] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:35:12] RECOVERY - Disk space on elastic1035 is OK: DISK OK [03:35:31] RECOVERY - Disk space on elastic1044 is OK: DISK OK [03:49:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 4916 MB (11% inode=98%) [04:01:41] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:02:14] (03PS3) 10Andrew Bogott: Tools: Enable PHP module mcrypt on Trusty execution nodes [puppet] - 10https://gerrit.wikimedia.org/r/324957 (https://phabricator.wikimedia.org/T97857) (owner: 10Tim Landscheidt) [04:04:41] (03CR) 10Andrew Bogott: [C: 032] Tools: Enable PHP module mcrypt on Trusty execution nodes [puppet] - 10https://gerrit.wikimedia.org/r/324957 (https://phabricator.wikimedia.org/T97857) (owner: 10Tim Landscheidt) [04:14:51] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [04:21:21] RECOVERY - Disk space on elastic1039 is OK: DISK OK [04:35:31] PROBLEM - Disk space on elastic1044 is CRITICAL: DISK CRITICAL - free space: / 5612 MB (12% inode=98%) [04:45:31] RECOVERY - Disk space on elastic1044 is OK: DISK OK [05:02:11] (03CR) 1020after4: "So you want to abort instead of just continuing and skipping the problematic key? Does it really add any safety?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) (owner: 10Volans) [05:02:18] (03CR) 1020after4: [C: 031] Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) (owner: 10Volans) [05:03:25] (03CR) 1020after4: [C: 031] Keyholder: add support for ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/339002 (https://phabricator.wikimedia.org/T158659) (owner: 10Volans) [05:04:01] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1873.40 Read Requests/Sec=4170.70 Write Requests/Sec=8.20 KBytes Read/Sec=18360.40 KBytes_Written/Sec=82.40 [05:04:17] (03CR) 1020after4: [C: 031] Revert "Add replication client grants to phuser" [puppet] - 10https://gerrit.wikimedia.org/r/335554 (owner: 10Jcrespo) [05:17:01] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=164.80 Read Requests/Sec=203.90 Write Requests/Sec=64.80 KBytes Read/Sec=3082.40 KBytes_Written/Sec=377.60 [06:04:03] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3045538 (10Niedzielski) Hmm... I've been trying to debug this further but aft... [06:15:31] PROBLEM - Disk space on elastic1044 is CRITICAL: DISK CRITICAL - free space: / 5689 MB (12% inode=98%) [06:19:31] RECOVERY - Disk space on elastic1044 is OK: DISK OK [06:46:20] (03PS2) 10Subramanya Sastry: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 [06:47:11] PROBLEM - Disk space on elastic1035 is CRITICAL: DISK CRITICAL - free space: / 5456 MB (12% inode=98%) [06:49:21] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 4963 MB (11% inode=98%) [06:49:31] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:49:31] PROBLEM - Disk space on elastic1044 is CRITICAL: DISK CRITICAL - free space: / 5311 MB (11% inode=98%) [07:04:01] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.257 second response time [07:07:11] RECOVERY - Disk space on elastic1035 is OK: DISK OK [07:09:01] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.140 second response time [07:13:06] (03PS1) 10Marostegui: db-codfw.php: Repool db2055 restore db2048 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339118 (https://phabricator.wikimedia.org/T132416) [07:16:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2055 restore db2048 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339118 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:17:28] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2055 restore db2048 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339118 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:17:31] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:17:36] (03CR) 10jenkins-bot: db-codfw.php: Repool db2055 restore db2048 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339118 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:18:36] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2055 - T132416 (duration: 00m 40s) [07:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:42] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:19:02] (03PS1) 10Marostegui: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339119 (https://phabricator.wikimedia.org/T132416) [07:21:23] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339119 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:22:32] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339119 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:22:40] (03CR) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339119 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:23:17] !log Deploy alter table enwiki.revision db2062 - T132416 [07:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:31] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2062 - T132416 (duration: 00m 40s) [07:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:01] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.361 second response time [07:25:11] PROBLEM - Disk space on elastic1035 is CRITICAL: DISK CRITICAL - free space: / 5126 MB (11% inode=98%) [07:29:01] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.128 second response time [07:30:11] PROBLEM - Disk space on elastic1035 is CRITICAL: DISK CRITICAL - free space: / 5314 MB (11% inode=98%) [07:43:18] !log trncating logs on elastic10(35|39|44) [07:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:57] !log restart elasticsearch on elastic1035 [07:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:11] RECOVERY - Disk space on elastic1035 is OK: DISK OK [07:57:09] (03PS2) 10Filippo Giunchedi: uwsgi: parametrize service settings [puppet] - 10https://gerrit.wikimedia.org/r/338804 [07:57:11] (03PS2) 10Filippo Giunchedi: coal: disable uwsgi autoload [puppet] - 10https://gerrit.wikimedia.org/r/338805 [07:57:13] (03PS1) 10Filippo Giunchedi: graphite: split checks for carbon-relay frontend/local drops [puppet] - 10https://gerrit.wikimedia.org/r/339122 [07:59:23] (03PS15) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [08:02:28] (03CR) 10Giuseppe Lavagetto: Add schema support (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [08:06:49] (03CR) 10Filippo Giunchedi: [C: 032] graphite: split checks for carbon-relay frontend/local drops [puppet] - 10https://gerrit.wikimedia.org/r/339122 (owner: 10Filippo Giunchedi) [08:07:35] !log upgrading openssl on redis clusters / various base service restarts [08:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:49] (03PS2) 10Filippo Giunchedi: graphite: split checks for carbon-relay frontend/local drops [puppet] - 10https://gerrit.wikimedia.org/r/339122 [08:12:51] (03PS3) 10Filippo Giunchedi: uwsgi: parametrize service settings [puppet] - 10https://gerrit.wikimedia.org/r/338804 [08:12:53] (03PS3) 10Filippo Giunchedi: coal: disable uwsgi autoload [puppet] - 10https://gerrit.wikimedia.org/r/338805 [08:17:07] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] graphite: split checks for carbon-relay frontend/local drops [puppet] - 10https://gerrit.wikimedia.org/r/339122 (owner: 10Filippo Giunchedi) [08:17:45] andrewbogott: I'm merging your change [08:18:05] and hope it doesn't break anything on tools [08:18:51] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [08:19:48] (03PS2) 10Filippo Giunchedi: Increase SWIFT_RETRIES in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/338957 (https://phabricator.wikimedia.org/T157949) (owner: 10Gilles) [08:22:01] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Increase SWIFT_RETRIES in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/338957 (https://phabricator.wikimedia.org/T157949) (owner: 10Gilles) [08:28:15] (03PS3) 10Giuseppe Lavagetto: prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 [08:28:33] (03PS1) 10Elukey: Add the Apache Prometheus exporter to bohrium [puppet] - 10https://gerrit.wikimedia.org/r/339123 (https://phabricator.wikimedia.org/T154558) [08:28:41] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#2771280 (10MoritzMuehlenhoff) That's still the case. Since piwik 2.16 is up and running on bohrium already, we could also simply use "dpkg-repack" to generate the deb from the installation o... [08:29:40] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto) [08:31:19] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#3045761 (10elukey) We have also opened https://phabricator.wikimedia.org/T158322 to upgrade piwik to a more recent version, should we set up another component like we did for thirdparty/cdh? [08:31:50] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5521/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/339123 (https://phabricator.wikimedia.org/T154558) (owner: 10Elukey) [08:33:38] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, minor nit on /metrics being the default" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto) [08:34:11] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openssl] [08:34:16] (03PS4) 10Giuseppe Lavagetto: prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 [08:35:13] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#3045766 (10MoritzMuehlenhoff) Let's wait with adding new components until we have a plan how to move forward with the repository. piwik is a single package, so there's no risk of it overshad... [08:36:11] <_joe_> godog: heh found an issue with my PS with the switch to TLS [08:36:16] <_joe_> godog: fixing :) [08:37:49] hehe ok [08:43:32] (03CR) 10Hashar: "From their README, that has been made on purpose I guess for the purpose of scripting?" [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [08:44:53] jynus: good morning. I had my coffee and brain is warmed up :} If you are still up for some puppet.git merges I am ready whenever you are. [08:45:17] hashar, give me 10 minutes [08:45:34] jynus: sure thing take all the time you need [08:45:55] (03PS3) 10Ema: cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) [08:48:02] (03PS5) 10Giuseppe Lavagetto: prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 [08:48:09] <_joe_> godog: I am going to merge this change, not sure if it will break things [08:48:22] <_joe_> as we never really tested prometheus::class_config [08:49:06] I might be interested to use it as well for bohrium :D [08:52:41] _joe_: ok, if the catalog compiles then the prometheus config itself isn't going to cause trouble I think [08:53:17] (03CR) 10Volans: "@twentyafterfour" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) (owner: 10Volans) [08:53:29] <_joe_> godog: well the function is going to be called from the prometheus hosts [08:53:38] <_joe_> which are uncompilable atm [08:55:08] yep that's what I meant by "if the catalog compiles" [09:00:13] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:03] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:02:08] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(35|39|43|44).eqiad.wmnet [09:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:55] (03PS6) 10Giuseppe Lavagetto: prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 [09:04:08] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] prometheus: add etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto) [09:04:58] !log rebuilding translation memories index - ETA ~4hours (from terbium, logs in ~dcausse/ttm-refresh) [09:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:11] (03PS2) 10Gehel: elasticsearch - reimage elastic10(35|39|43|44) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339017 (https://phabricator.wikimedia.org/T151326) [09:06:19] <_joe_> godog: heh, it fails ofc [09:06:29] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(35|39|43|44) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339017 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [09:07:40] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3045854 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1043.eqiad.wmnet'] ``` The... [09:08:02] _joe_: of course [09:08:11] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3045863 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1035.eqiad.wmnet'] ``` The... [09:08:33] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:38] <_joe_> godog: with an obscure message, will need to do debugging [09:08:53] <_joe_> I'll take a coffee first [09:09:28] hashar, we can probably go now? [09:09:51] sure thing :} [09:10:14] https://gerrit.wikimedia.org/r/#/c/338143/ is to instruct puppet-syntax to stop trying to puppet parser validate a bunch of stdlib manifests that are puppet 4 only [09:10:39] so merely just an ignore rule, and that is solely when running puppet syntax check locally on all manifests [09:10:45] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3045883 (10fgiunchedi) 05Open>03Resolved Switchback to graphite1001 has been completed, I've updated {T88997} for followup on what services didn't follow the CNAME change correc... [09:11:02] later on I will find out a way to syntax check both with puppet 3.7 and puppet 4.8 but time hasn't come for that yet :) [09:11:05] yeah I'll take a break too [09:11:51] the three other changes are related to Jenkins configuration file. And if we feel brave we can try switching to systemd :} [09:12:03] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:33] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:02] the prometheus puppet failures are expected btw [09:14:13] (03PS5) 10Jcrespo: syntax: ignore stdlib Puppet 4 manifests [puppet] - 10https://gerrit.wikimedia.org/r/338143 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [09:14:34] (03PS1) 10Gehel: elasticsearch - reimage elastic10(45|46|47) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339130 (https://phabricator.wikimedia.org/T151326) [09:14:35] !log Run pt-table-checksum on s2.nlwiki over some tables - T154485 [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:40] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [09:15:03] PROBLEM - DPKG on neodymium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:16:03] RECOVERY - DPKG on neodymium is OK: All packages OK [09:18:03] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:30] Fetching source index from https://rubygems.org/... [09:19:33] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:21:24] (03CR) 10Jcrespo: [C: 032] syntax: ignore stdlib Puppet 4 manifests [puppet] - 10https://gerrit.wikimedia.org/r/338143 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [09:25:04] (03PS7) 10Jcrespo: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385 (owner: 10Hashar) [09:27:13] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:47] jynus: I will handle the puppet runs on the two hosts ( contint1001 and contint2001 ) [09:28:05] ok, should we disable them now? [09:28:13] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:28:30] !log disable puppet on contint1001. Will use contint2001 as a canary [09:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:35] jynus: done :} [09:28:48] I will merge them separatedly [09:28:52] so if only one fails [09:29:00] we can revert only that one [09:29:09] (03CR) 10Jcrespo: [C: 032] jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385 (owner: 10Hashar) [09:29:25] yup seems easier to have a lot of incremental patches [09:29:42] (03PS10) 10Jcrespo: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar) [09:29:58] I have to praise your work, lot of puppet normally means you are doing your part on the cleaning up [09:30:20] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3045971 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1035.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1035.eqi... [09:30:29] as first goal I want jenkins to be harnessed with systemd [09:30:33] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [09:30:43] then be able to have multiple jenkins instances in parallel on the same host [09:30:43] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:30:47] (03CR) 10DCausse: WIP - elasticsearch: only send minimal logging to console (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) (owner: 10Gehel) [09:31:03] and later starts puppetizing the crap of xml configuration files Jenkins has. But that part is a bit challenging [09:31:32] you can run puppet now [09:31:33] running puppet on contint2001 [09:31:52] will restart the service with the new conf and diff ps -u jenkins f [09:32:34] (03CR) 10Gehel: WIP - elasticsearch: only send minimal logging to console (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) (owner: 10Gehel) [09:32:53] pfff [09:33:11] the .erb is wrong :( [09:33:24] <%= [09:33:30] JENKINS_ACCESSLOG_ENABLE=xxxx [09:33:31] %> [09:33:36] that is actually evaluated as ruby [09:33:44] yes [09:33:45] and thus yields: xxxx [09:33:57] extra files [09:33:59] so the /etc/default/jenkins ends up with just: [09:34:00] --accessLoggerClassName=winstone.accesslog.SimpleAccessLogger --simpleAccessLogger.format=combined --simpleAccessLogger.file=/var/log/$NAME/access.log [09:34:14] you have to remove lines 79 [09:34:16] and 83 [09:34:23] PROBLEM - carbon-local-relay metric drops on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [09:34:36] (03PS1) 10Jcrespo: Revert "jenkins: allow access log to be flipped" [puppet] - 10https://gerrit.wikimedia.org/r/339133 [09:34:45] let's revert that, try with the others [09:35:00] and probably use the puppet compiler [09:35:19] (03CR) 10DCausse: WIP - elasticsearch: only send minimal logging to console (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) (owner: 10Gehel) [09:35:34] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "jenkins: allow access log to be flipped" [puppet] - 10https://gerrit.wikimedia.org/r/339133 (owner: 10Jcrespo) [09:35:36] * hashar takes note he needs to add tests for the template expansions [09:35:52] eek [09:36:27] one of the change adds some basic rspec to at least assert the class compiles in a catalog [09:36:31] 06Operations, 10ops-eqiad, 06DC-Ops: 1624-Power Supply Unplugged - Power Supply 1 is unplugged - elastic1043 - https://phabricator.wikimedia.org/T158749#3046028 (10Gehel) [09:36:37] would have to build on top of that and add tests for the templates expansions [09:37:00] !log cache_text, cache_upload: libssl1.1 upgraded to 1.1.0e-1+wmf1, libevent-2.0-5 upgraded to 2.0.21-stable-2+deb8u1 [09:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:13] PROBLEM - puppet last run on prometheus1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:37:20] can you revase manually 337388 ? [09:37:24] *rebase [09:37:43] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:38:24] (03PS1) 10Hashar: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/339134 [09:38:33] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:38:41] jynus: https://gerrit.wikimedia.org/r/339134 remove the <%= from the erb template [09:39:02] I am rebasing 337388 on top of it [09:39:23] RECOVERY - carbon-local-relay metric drops on graphite2001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:39:31] ok [09:39:36] (03PS11) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [09:39:39] want me to run the compiler ? [09:40:03] (03CR) 10Jcrespo: [C: 032] jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/339134 (owner: 10Hashar) [09:40:30] it should be ok, it was an oversight on my side [09:40:33] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:40:53] will add some tests this afternoon :} [09:41:27] +JENKINS_ACCESSLOG_ENABLE="--accessLoggerClassName=winstone.accesslog.SimpleAccessLogger --simpleAccessLogger.format=combined --simpleAccessLogger.file=/var/log/$NAME/access.log" [09:41:29] \O/ [09:42:08] restarting jenkins on contint2001 and comparing [09:42:24] ok [09:43:14] all good [09:43:22] shall I continue? [09:43:25] yes! [09:43:39] https://gerrit.wikimedia.org/r/#/c/337388/ is a bit annoying [09:43:51] that is to let us change the TCP port a Jenkins web service listens to [09:44:12] and there is additionally an Apache proxy in front of the Jenkins web service which thus need the port to be adjusted as well [09:44:37] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/5523/console [09:44:46] \O/ [09:45:23] RECOVERY - Check systemd state on elastic1043 is OK: OK - running: The system is fully operational [09:45:25] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046053 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1043.eqiad.wmnet'] ``` and were **ALL** successful. [09:45:42] Unable to find facts for host contint2001.codfw.wmnet <-- did I spell it wrong? [09:45:54] eeek [09:45:57] contint2001.wikimedia.org [09:46:08] the hosts have public IP to be able to reach out labs [09:46:11] ahm thanks [09:46:40] it says ok for 1001 [09:46:43] so merging [09:46:52] neat [09:46:57] I love noop output [09:47:06] (03CR) 10Jcrespo: [C: 032] jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar) [09:48:13] RECOVERY - Elasticsearch HTTPS on elastic1043 is OK: SSL OK - Certificate elastic1043.eqiad.wmnet valid until 2022-02-21 09:46:30 +0000 (expires in 1824 days) [09:48:41] running puppet on contint2001 [09:48:52] noop [09:49:16] (03PS4) 10Hashar: jenkins: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/337836 [09:49:20] I think only 337836 is left [09:49:51] so that one adds rspec-puppet tests for the jenkins module [09:50:02] basically attempts to compile the catalog for the 3 classes we have [09:50:23] so if there is some basic fault somewhere, that will report it when running the tests locally [09:50:34] it is kind of the equivalent of the puppet compiler but without the node name nor the facts [09:50:53] to run it one need to install the gems: bundle install [09:51:02] then: cd modules/jenkins && bundle exec rake spec [09:51:26] that creates a fake hierarchy of puppet modules under spec/fixtures/ based on what is listed in the .fixtures.yaml (really it just symlink other modules) [09:51:42] then for each of the three class, try to compile the catalog, eventually passing some parameters [09:51:47] I get it [09:51:51] one sure thing, it is a noop to prod. [09:52:03] it is just that we discussed about testing puppet [09:52:10] my .plan is to polish up some doc and probably give a short presentation about it to the whole ops team [09:52:23] how worth it it is to test puppet's ruby doing what it is supposed to do [09:52:57] I can merge it, no problem (I am giving it a check on all code created) [09:52:58] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3046056 (10elukey) Today I tried to do the following: * Bump the maximum Apache connections allowed (hence the number of processes due... [09:53:03] my first real use cases for tests was when I refactored the zuul manifest to use hiera lookup instead of parameters in roles [09:53:06] that helped dramatically [09:53:21] but we should talk about testing more generally with other ops [09:53:27] (puppet code) [09:53:44] modules/jenkins/.fixtures.yaml list 9 other modules which I am not happy about. But that just show how heavily coupled our modules are [09:54:39] from discussions I had with Alexandros and some others, one culprit is when the tests are just a reimplementation of the puppet code. That is not so useful [09:54:53] but we will find out :} [09:55:33] what fails on those excluded modules? [09:55:42] and how these treats submodules? [09:56:13] !log restarting salt-master on neodymium after openssl upgrade [09:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:14] jynus: which excluded modules are you referring to ? [09:57:49] the ones on the yaml [09:57:52] oh [09:58:07] so puppet is configured to use modules/jenkins/spec/fixtures as a base path [09:58:27] the rspec-puppet creates an empty site.pp in there: modules/jenkins/spec/fixtures/manifests/site.pp [09:58:53] then process .fixtures.yml to create modules under modules/jenkins/spec/fixtures/modules/ (which puppet is configured to point at) [09:58:57] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046057 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1039.eqiad.wmnet'] ``` The... [09:58:58] (03CR) 1020after4: [C: 031] "With my confusion now eliminated, I say merge it. :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) (owner: 10Volans) [09:59:08] ok, I didn't read the dependent code [09:59:13] so when one runs rake spec, that would symlink all the modules listed in the .fixtures.yml file [09:59:26] I got it wrong [09:59:30] you can try with: cd modules/jenkins && bundle exec rake spec_prep [09:59:38] that does the fixtures preparation, but does not run tests [09:59:51] then git status --ignore , would show bunch of symlinks have been created [10:00:09] (03CR) 10Jcrespo: [C: 032] jenkins: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/337836 (owner: 10Hashar) [10:00:22] jouncebot: next [10:00:22] In 3 hour(s) and 59 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T1400) [10:00:29] the Puppet configuration is handled in modules/jenkins/spec/spec_helper.rb which is required by each tests. It sets puppet module_path and manifest_dir to the dirs under fixtures [10:00:50] sigh my deployment got overwritten? [10:00:56] it is all a bit messy :-/ Gotta present that in a nice doc [10:01:29] going ahead anyways [10:01:46] godog, what is it? [10:01:47] !log enabling puppet on contint1001 and running it [10:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:20] jynus: moving mw udp2log from fluorine to mwlog1001 [10:02:30] (03PS3) 10Filippo Giunchedi: Switch udp2log destination to mwlog1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337560 (https://phabricator.wikimedia.org/T123728) [10:02:35] yeah, better go ahead now [10:02:37] that ^ [10:02:43] thank later [10:02:54] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Switch udp2log destination to mwlog1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337560 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [10:03:03] hashar, test that lat deployment [10:03:09] and we should stop here [10:03:13] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:03:14] doing so [10:03:15] -JENKINS_ACCESSLOG_ENABLE="--accessLoggerClassName=winstone.accesslog.SimpleAccessLogger --simpleAccessLogger.format=combined --simpleAccessLogger.file=/var/log/jenkins/access.log" [10:03:15] +JENKINS_ACCESSLOG_ENABLE="--accessLoggerClassName=winstone.accesslog.SimpleAccessLogger --simpleAccessLogger.format=combined --simpleAccessLogger.file=/var/log/$NAME/access.log" [10:03:37] which just change 'jenkins' to '$NAME' and name is set [10:03:41] so all good to me [10:04:13] jynus: thanks a ton :} [10:04:15] (03Merged) 10jenkins-bot: Switch udp2log destination to mwlog1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337560 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [10:04:24] there was one left about systemd , but I guess I am going to try it out on labs first [10:04:31] yes [10:04:34] (03CR) 10jenkins-bot: Switch udp2log destination to mwlog1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337560 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [10:04:37] thx jynus ! [10:05:52] !log filippo@tin Synchronized wmf-config/ProductionServices.php: Move udp2log from fluorine to mwlog1001 - T123728 (duration: 00m 41s) [10:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:58] T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728 [10:06:16] now that I have the spec merged, I can even test it locally :} [10:06:29] jenkins should compile into a catalogue without dependency cycles [10:06:30] error during compilation: Invalid parameter programname_compare on Systemd::Syslog[jenkins] [10:06:31] hehe [10:06:52] good [10:08:19] (03PS1) 10Giuseppe Lavagetto: prometheus::class_config: fix error in query building [puppet] - 10https://gerrit.wikimedia.org/r/339138 [10:09:28] godog, I can see logs flowing towards mwlog, as expected [10:09:42] (03CR) 10jerkins-bot: [V: 04-1] prometheus::class_config: fix error in query building [puppet] - 10https://gerrit.wikimedia.org/r/339138 (owner: 10Giuseppe Lavagetto) [10:10:06] jynus: yup, looks like it is working! [10:10:24] also mwlog -> fluorine relay is working afaict [10:10:44] do you know what is the architecture of that? [10:11:13] yeah a simple python3 script [10:11:20] oh, really? [10:11:25] (03PS2) 10Giuseppe Lavagetto: prometheus::class_config: fix error in query building [puppet] - 10https://gerrit.wikimedia.org/r/339138 [10:11:46] if you can point me to it, I can read it on my own [10:11:58] yeah it is called udpmirror.py in puppet [10:12:02] thanks [10:12:20] (03PS5) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) [10:12:22] interestingly not all mw hosts have switched, I still see udp2log traffic directly to fluorine [10:12:25] basically I fear at some point logs breaking [10:12:40] and having to understand that [10:13:07] is that a mediawiki config? [10:13:17] because there may be long-running processes [10:13:26] it is yeah, terbium still sends to fluorine but that's expected [10:13:29] with an old config [10:13:53] (03PS3) 10Giuseppe Lavagetto: prometheus::class_config: fix error in query building [puppet] - 10https://gerrit.wikimedia.org/r/339138 [10:13:55] even outside of terbium- I have to fight with them whenever I do a db topology change [10:14:09] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] prometheus::class_config: fix error in query building [puppet] - 10https://gerrit.wikimedia.org/r/339138 (owner: 10Giuseppe Lavagetto) [10:14:12] (dumps, wikidata dumps, long-running api queries, ...) [10:14:23] PROBLEM - carbon-local-relay metric drops on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [10:15:23] RECOVERY - carbon-local-relay metric drops on graphite2001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:17:33] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:17:34] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [10:19:15] jynus: indeed, is there manual intervention required or they'll just die off over time? [10:19:19] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046072 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1044.eqiad.wmnet'] ``` The... [10:20:05] godog- they decay over time, but check back in 10 minutes [10:20:10] or 30 [10:20:21] let me see for example if there are ongoing db activity [10:21:14] only a 51-minute terbium run [10:21:15] sigh, small mistake in udpmirror, fixing [10:21:22] oh [10:22:34] (03PS4) 10Filippo Giunchedi: uwsgi: parametrize service settings [puppet] - 10https://gerrit.wikimedia.org/r/338804 [10:22:36] (03PS4) 10Filippo Giunchedi: coal: disable uwsgi autoload [puppet] - 10https://gerrit.wikimedia.org/r/338805 [10:22:38] (03PS1) 10Filippo Giunchedi: udp2log: limit getaddrinfo results to SOCK_DGRAM in udpmirror [puppet] - 10https://gerrit.wikimedia.org/r/339140 [10:23:04] (03PS6) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) [10:24:26] (03PS3) 10Jcrespo: Revert "Add replication client grants to phuser" [puppet] - 10https://gerrit.wikimedia.org/r/335554 [10:25:04] (03CR) 10Filippo Giunchedi: [C: 032] udp2log: limit getaddrinfo results to SOCK_DGRAM in udpmirror [puppet] - 10https://gerrit.wikimedia.org/r/339140 (owner: 10Filippo Giunchedi) [10:26:01] (03PS4) 10Jcrespo: Revert "Add replication client grants to phuser" [puppet] - 10https://gerrit.wikimedia.org/r/335554 [10:26:13] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:28:03] (03CR) 10Elukey: "sanity check in https://puppet-compiler.wmflabs.org/5527/" [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [10:28:23] RECOVERY - Disk space on elastic1039 is OK: DISK OK [10:28:58] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046080 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1039.eqiad.wmnet'] ``` and were **ALL** successful. [10:29:43] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:31:13] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:31:43] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [10:32:11] (03CR) 10Jcrespo: [C: 032] Revert "Add replication client grants to phuser" [puppet] - 10https://gerrit.wikimedia.org/r/335554 (owner: 10Jcrespo) [10:32:51] (03PS7) 10Elukey: Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) [10:34:13] RECOVERY - puppet last run on prometheus1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:34:15] godog, elukey: another thing I noticed related to the switchover when I upgraded openssl; poolcounters in eqiad are running on jessie these days (poolcounter100[12]), while the ones in codfw (subra/suhail) are still on trusty. might be worth reimaging before the switchover as well? [10:34:54] moritzm: the poolcounters are still a bit of a mistery for me, but I can take care of the work :) [10:35:05] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3046111 (10fgiunchedi) >>! In T123728#3043456, @Ottomata wrote: > Just FYI, there is a Kafka based Monolog implementation in Mediawiki, currently used by the Discovery team for shipping some logs to... [10:35:43] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:36:16] moritzm: yeah should be straightforward enough, basically do the same thing we did in eqiad (one baremetal, one VM) and reimage as poolcounter200* [10:36:20] cc elukey ^ [10:37:32] ack [10:39:06] both subra and suhail are OOW since August 2015, we could also couple that with a server update [10:39:19] (for the non-ganeti host) [10:40:58] jynus: yeah looks like most non-terbium udp2log traffic has died off [10:41:03] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:41:20] (03CR) 10Elukey: [V: 032 C: 032] Move three codfw MW appservers to jobrunner/videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/338962 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [10:41:23] PROBLEM - carbon-local-relay metric drops on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [10:41:26] I'll check back later and tomorrow [10:41:33] RECOVERY - Disk space on elastic1044 is OK: DISK OK [10:41:34] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [10:41:46] now onto that alert instead [10:42:33] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:42:49] !log reinstall mw211[89] as MW videoscalers (trusty) and mw2243 as MW jobrunner [10:42:50] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046133 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1044.eqiad.wmnet'] ``` and were **ALL** successful. [10:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:23] RECOVERY - carbon-local-relay metric drops on graphite2001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:45:30] (03PS1) 10Filippo Giunchedi: wmnet: switch udplog CNAME to mwlog1001 [dns] - 10https://gerrit.wikimedia.org/r/339146 (https://phabricator.wikimedia.org/T123728) [10:45:50] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3046139 (10fgiunchedi) [10:46:13] (03CR) 10Filippo Giunchedi: [C: 032] wmnet: switch udplog CNAME to mwlog1001 [dns] - 10https://gerrit.wikimedia.org/r/339146 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [10:46:17] (03PS2) 10Filippo Giunchedi: wmnet: switch udplog CNAME to mwlog1001 [dns] - 10https://gerrit.wikimedia.org/r/339146 (https://phabricator.wikimedia.org/T123728) [10:47:03] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:50:11] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/334294 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [10:53:40] (03PS1) 10Jcrespo: mariadb-phabricator: Fix typo on grant (s/REPLICATON/REPLICATION/) [puppet] - 10https://gerrit.wikimedia.org/r/339147 [10:54:10] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(35|37|39|43|44).eqiad.wmnet [10:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:35] !log upgrading remaining mediawiki servers to HHVM 3.12.14 [10:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:12] (03PS2) 10Jcrespo: mariadb-phabricator: Fix typo on grant (s/REPLICATON/REPLICATION/) [puppet] - 10https://gerrit.wikimedia.org/r/339147 [10:55:34] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(45|46|47).eqiad.wmnet [10:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:18] (03PS2) 10Gehel: elasticsearch - reimage elastic10(45|46|47) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339130 (https://phabricator.wikimedia.org/T151326) [10:56:20] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb-phabricator: Fix typo on grant (s/REPLICATON/REPLICATION/) [puppet] - 10https://gerrit.wikimedia.org/r/339147 (owner: 10Jcrespo) [11:00:46] (03PS3) 10Gehel: elasticsearch - reimage elastic10(45|46|47) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339130 (https://phabricator.wikimedia.org/T151326) [11:01:49] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(45|46|47) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/339130 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [11:03:11] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046171 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1045.eqiad.wmnet'] ``` The... [11:05:39] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:08:05] (03PS1) 10Jcrespo: mariadb: Pool db1045 with low weight (1:10, compared to db1026) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339149 (https://phabricator.wikimedia.org/T147747) [11:10:08] (03PS1) 10Giuseppe Lavagetto: prometheus::class_config: properly escape and build the puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/339150 [11:21:39] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:22:24] this one was me with wmf-reimage --^ [11:26:41] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046228 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1045.eqiad.wmnet'] ``` and were **ALL** successful. [11:29:20] elukey: that check could be improved a bit [11:29:47] volans: yep.. [11:30:42] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:31:48] (03PS2) 10Giuseppe Lavagetto: prometheus::class_config: properly escape and build the puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/339150 [11:33:15] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046235 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1046.eqiad.wmnet'] ``` The... [11:35:42] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:36:52] .erb is a drama :( [11:38:21] (03PS1) 10Elukey: Abort wmf-auto-reimage with empty IPMI_PASSWORD [puppet] - 10https://gerrit.wikimedia.org/r/339156 [11:40:06] basically https://gfycat.com/LameMaleGuanaco [11:40:30] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus::class_config: properly escape and build the puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/339150 (owner: 10Giuseppe Lavagetto) [11:40:32] PROBLEM - carbon-local-relay metric drops on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [11:42:10] godog: ahahah [11:42:32] godog: exactly what happened to me [11:42:46] when Riccardo told me what was the problem [11:43:05] lol [11:43:26] yep, you'd think you fixed it and walk away, but no [11:43:32] RECOVERY - carbon-local-relay metric drops on graphite2001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:46:29] now Giuseppe told another thing, namely that I forgot to change partman recipe for the new videoscalers [11:46:42] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:46:42] (03PS1) 10Muehlenhoff: Readd base::firewall to multatuli [puppet] - 10https://gerrit.wikimedia.org/r/339158 [11:46:48] so I'd probably just need to flip some tables around to make up with my karma [11:47:02] (03PS2) 10Muehlenhoff: Readd base::firewall to multatuli [puppet] - 10https://gerrit.wikimedia.org/r/339158 [11:48:01] !log upgrading labmon1001 to grafana 4.1 [11:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:44] (03CR) 10Muehlenhoff: [C: 032] Readd base::firewall to multatuli [puppet] - 10https://gerrit.wikimedia.org/r/339158 (owner: 10Muehlenhoff) [11:55:22] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:55:54] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1045 with low weight (1:10, compared to db1026) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339149 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [11:57:11] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/5530/" [puppet] - 10https://gerrit.wikimedia.org/r/338804 (owner: 10Filippo Giunchedi) [11:57:22] (03Merged) 10jenkins-bot: mariadb: Pool db1045 with low weight (1:10, compared to db1026) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339149 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [11:57:32] (03CR) 10jenkins-bot: mariadb: Pool db1045 with low weight (1:10, compared to db1026) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339149 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [11:57:47] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046269 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1046.eqiad.wmnet'] ``` and were **ALL** successful. [11:57:48] any volunteers/takers for https://gerrit.wikimedia.org/r/#/c/338804 and https://gerrit.wikimedia.org/r/#/c/338805/ ? shouldn't be impactful to non-affected uwsgi instances [11:58:43] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:58:53] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:00:25] (03CR) 10Jcrespo: [C: 031] uwsgi: parametrize service settings [puppet] - 10https://gerrit.wikimedia.org/r/338804 (owner: 10Filippo Giunchedi) [12:00:51] I am ok with the first one, I do not know enought to know what is the impact of the second [12:00:53] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:01:03] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python-memcache],Package[python-sqlalchemy],Package[graphite-carbon] [12:02:01] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046275 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1047.eqiad.wmnet'] ``` The... [12:03:17] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(45|46).eqiad.wmnet [12:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:31] (03CR) 10Volans: [C: 031] "LGTM, thanks for adding this" [puppet] - 10https://gerrit.wikimedia.org/r/339156 (owner: 10Elukey) [12:05:07] jynus: the second is essentially to fix coal on graphite1001 [12:05:33] (03PS2) 10Elukey: Abort wmf-auto-reimage with empty IPMI_PASSWORD [puppet] - 10https://gerrit.wikimedia.org/r/339156 [12:05:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 with low load (duration: 02m 49s) [12:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:03] thanks though! [12:06:56] (03CR) 10Elukey: [V: 032 C: 032] Abort wmf-auto-reimage with empty IPMI_PASSWORD [puppet] - 10https://gerrit.wikimedia.org/r/339156 (owner: 10Elukey) [12:07:13] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:23] RECOVERY - DPKG on labmon1001 is OK: All packages OK [12:08:05] (03CR) 10Giuseppe Lavagetto: [C: 032] tlsproxy: add nginx_bootstrap define [puppet] - 10https://gerrit.wikimedia.org/r/333247 (owner: 10Filippo Giunchedi) [12:09:36] (03CR) 10Giuseppe Lavagetto: [C: 032] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [12:17:11] (03PS1) 10Elukey: Change partman recipe for new MW codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/339166 (https://phabricator.wikimedia.org/T156023) [12:18:47] !log rebuild of translation memories index is done [12:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:06] I am going to retry the scap, too many errors [12:20:34] are servers being installed/reimaged? [12:20:48] "mw2119.codfw.wmnet returned [127]: bash: /usr/bin/scap: No such file or directory" [12:21:00] and 58 others timing out [12:21:20] jynus: lots of elasticsearch reimages (but you probably don't care about those) [12:21:33] volans restarted salt this morning, might be related [12:21:48] jynus: I am reimaging 3 in codfw, mw211[89] and mw2243, but they should be depooled and out of codfw [12:21:54] no [12:22:00] but 3 I would understand [12:22:01] err out of dsh [12:22:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 with low load (again) (duration: 02m 47s) [12:22:07] there are 57 failures [12:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:11] 58 [12:22:23] pull failed [12:22:30] all in codfw or also in eqiad? [12:22:42] jynus: looking [12:23:03] all in codfw [12:23:22] interesting.. this is new [12:23:34] I think those are new [12:23:44] but why are they on the pool of sync? [12:23:46] they are not connected to salt, but they were after the restart [12:24:23] but salt should not be touched for doing an rsync [12:24:58] any specific cluster/type of host? Yesterday I moved all the mw2* jobrunners/videoscalers/imagescalers from conf-tool eqiad to codfw [12:25:44] any other to test? mw2119 is not reachable by salt [12:26:02] oh but is the one reimaged by elukey, nevermind [12:26:09] ignore me :) [12:26:37] mw2119 [12:26:59] maybe it is a proxy what it is down? [12:27:35] ouch you are completely right [12:27:45] jynus: no ignore what I said, mw2119 the one reimaged by elukey that got stuck, is expected [12:27:53] no no jynus is right [12:28:06] I missed the fact that mw2119 is a proxy [12:28:10] ahhh is a proxy for scap [12:28:10] ok [12:28:15] problem solved [12:28:24] let's take it out temporarelly [12:28:34] so 50 don't fail [12:28:48] of if 2119 is back soon [12:28:51] I can wait [12:29:33] it will be a videoscaler and I'll need to reimage it again, so maybe we could elect mw2117 as new proxy [12:29:47] ok, let's do a CR [12:29:52] going to do it [12:30:52] checking where mw2117 is before [12:30:55] should be row b [12:31:14] I was editing, but I will stop [12:31:25] if you are already on it [12:31:50] as you prefer, it will take me a min [12:31:59] B3 [12:32:02] (mw2117 is in B3, should be ok) [12:32:06] if racktables is correct [12:32:54] lldpcli show neighbors | grep SysName shows asw-b-codfw [12:34:16] it was just a comment [12:34:18] (03PS1) 10Elukey: Replace mw2119 with mw2117 as scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/339167 (https://phabricator.wikimedia.org/T156023) [12:34:23] don't take me too seriously [12:34:43] oh yes yes I only double checked to be sure! [12:35:00] (03CR) 10Jcrespo: [C: 031] Replace mw2119 with mw2117 as scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/339167 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [12:35:23] (03CR) 10Elukey: [V: 032 C: 032] Replace mw2119 with mw2117 as scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/339167 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [12:35:24] I've looped in giu in case he has something to say as a bad selection [12:36:13] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:36:33] all right, merging and running puppet on mw2117 and tin [12:37:29] looking good [12:37:45] 06Operations: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757#3046345 (10Volans) [12:38:14] 06Operations: Internal PKI for secure communication - Barcelona Ops offsite 2016 - https://phabricator.wikimedia.org/T150822#2797805 (10Volans) Related issue with the current Puppet certificates: T158757 [12:38:27] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 with low load (3rd time a charm) (duration: 00m 39s) [12:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:40] looks good, no errors [12:38:51] \o/ [12:38:56] sorry for the trouble jynus [12:39:05] oh, no trouble at all [12:39:22] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3046382 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1047.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1047.eqi... [12:39:24] my changes are mostly idempotent- can be applied slowly and several times [12:40:02] and better me to find it rather than non-ops [12:40:09] I usually do git grep for all the hosts that I change to prevent these issues, but this time it was a PEBKAC probably [12:40:15] yeah :( [12:41:00] marostegui, db1001 replication down, is that you= [12:41:04] (03PS2) 10Elukey: Change partman recipe for new MW codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/339166 (https://phabricator.wikimedia.org/T156023) [12:41:08] nope [12:41:09] checking [12:41:21] 1001? [12:41:36] several servers in fact [12:41:55] checking 1001 [12:42:25] bacula related? :? [12:42:43] not bacula [12:42:51] bacula doesn't access those servers [12:43:05] 1001 and 2010 have replication broken because of bacula trying to create a table [12:43:19] * an index [12:43:22] oh [12:43:24] bacula [12:43:27] as a client [12:43:37] sorry, yes [12:43:37] not as a service [12:43:41] related to bacula database [12:43:44] I was like, makes no sense? [12:43:48] XDD [12:43:52] yeah, sorry I wasn't clear [12:43:54] ok, we should fix that [12:43:58] not worries me much [12:44:14] what about db1069:3312 ? [12:44:28] that looks related to the pt-table-checksum [12:44:43] ok, makes sense [12:44:43] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [12:44:48] can you give a look at that [12:45:01] ^ ignore the hhvm alert, update [12:45:01] I will take care of m1 or m2, wherever is bacula [12:45:08] m1 [12:45:13] i will fix db1069 [12:46:01] has any one run an upgrade script or something on bacula? [12:46:41] this is yesterday at 22:51 [12:48:27] there is a missing replication check there, maybe due to topology changes [12:50:07] we haven't changed m1 recently, no? [12:50:25] (03CR) 10Elukey: [C: 032] Change partman recipe for new MW codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/339166 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [12:50:25] not in a long time [12:50:58] and there is not such a table on the master [12:51:13] so it is not like an accidental drop or anything [12:51:19] :o [12:51:23] manually done? [12:51:37] we can check the logs [12:52:17] when replication fails on all slaves, it is normally the master's fault [12:52:25] 06Operations, 07discovery-system: confctl SubjectAltNameWarning after python-urllib3 upgrade - https://phabricator.wikimedia.org/T156232#2968185 (10Volans) The main issue is tracked in T158757. For conftool the temporary solution is to ignore the warning: ``` from requests.packages.urllib3.exceptions import S... [12:55:53] PROBLEM - MegaRAID on db1049 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [12:55:54] ACKNOWLEDGEMENT - MegaRAID on db1049 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T158761 [12:56:03] 06Operations, 10ops-eqiad: Degraded RAID on db1049 - https://phabricator.wikimedia.org/T158761#3046463 (10ops-monitoring-bot) [12:56:04] :( [12:56:15] we have disks for that one :-) [12:56:18] marostegui: stop breaking DB's disks :D [12:56:32] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1049 - https://phabricator.wikimedia.org/T158761#3046470 (10Marostegui) [12:56:34] it wasn't me I promise!! [12:56:47] I am going to skip that event, marostegui [12:56:54] jynus: ok [12:57:29] ah [12:57:33] I see now [12:57:41] "CREATE TEMPORARY TABLE DelCandidates" [12:58:00] temporary tables and replication do not get along very well [12:58:42] good news is that I think they are not used for storing real data [12:58:47] so it should be safe [13:00:31] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1049 - https://phabricator.wikimedia.org/T158761#3046481 (10Marostegui) p:05Triage>03High a:03Cmjohnson This is correct, that disk is broken: ``` Enclosure Device ID: 32 Slot Number: 4 Drive's position: DiskGroup: 0, Span: 2, Arm: 0 Enclosure positio... [13:00:48] ^ set it to high because it is s5 master [13:02:12] comment that on the ticket itself [13:02:21] I did already :) [13:02:26] ah, thanks [13:02:38] I have ignored bacula.DelCandidates [13:02:42] ok :) [13:02:50] but we may have to reconstruct the slaves [13:03:32] or [13:03:38] run pt-table-checsum there [13:03:47] we do not care about lag on those slaves [13:04:07] we also have to enable the replication check as non-critical [13:04:21] I may create a ticket to write all of that down [13:12:47] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:13:17] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:07] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 73595 bytes in 0.071 second response time [13:17:04] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339172 (https://phabricator.wikimedia.org/T158762) [13:19:37] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:44] (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339172 (https://phabricator.wikimedia.org/T158762) [13:19:57] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [13:20:09] this is Moritz upgrading hhvm [13:20:27] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.054 second response time [13:23:12] ack, these are depooled [13:25:43] (03PS3) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [13:26:53] we should integrate the 2 things [13:27:13] (03PS4) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [13:27:15] so that alerting is aware of the pooled state #ideal-world :) [13:28:07] jouncebot: next [13:28:08] In 0 hour(s) and 31 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T1400) [13:29:20] (03CR) 10jerkins-bot: [V: 04-1] Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [13:33:23] (03PS5) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [13:33:32] ACKNOWLEDGEMENT - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 174276 Ema Keeping an eye on it [13:39:52] (03CR) 10Volans: [C: 031] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/338985 (owner: 10Giuseppe Lavagetto) [13:40:57] PROBLEM - DPKG on mw1286 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:41:57] RECOVERY - DPKG on mw1286 is OK: All packages OK [13:44:17] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:18] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:18] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:18] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:27] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:28] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:37] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:37] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:37] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:47] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:07] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:45:08] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:45:08] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:45:08] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:45:17] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:45:17] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:45:27] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:45:27] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:45:27] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:45:37] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [13:45:44] ^ checking [13:46:52] I guess it was overloaded for a bit with the backups (as they are running now) [13:47:57] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:48:56] (03CR) 10Volans: [C: 031] "LGMT, I think it's ready for a real test in labs! :)" [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [13:52:06] (03CR) 10Giuseppe Lavagetto: [C: 032] Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [13:52:16] (03CR) 10Giuseppe Lavagetto: [C: 032] Only output "changed" values if actually changed [software/conftool] - 10https://gerrit.wikimedia.org/r/338985 (owner: 10Giuseppe Lavagetto) [13:55:07] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T1400). [14:00:05] kart_ and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:20] * kart_ here. [14:01:18] o/ [14:02:42] zeljkof: SWAT'ng? [14:03:00] waiting for somebody to say they will swat today ;) [14:03:04] anybody? [14:04:19] (03Merged) 10jenkins-bot: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [14:04:24] ok, it has been a while since I've swatted, so... [14:04:30] I can SWAT today! [14:04:46] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic1047.eqiad.wmnet [14:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:56] Urbanecm: around for swat? [14:05:03] zeljkof: I've special requirement for SWAT. See deployment calendar :) [14:05:22] kart_: looking... [14:05:41] (03PS2) 10Zfilipin: Deploy Compact Language Links in Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338790 (https://phabricator.wikimedia.org/T157114) (owner: 10KartikMistry) [14:06:16] ok, kart_ is the first, rebasing your patch, will merge, deploy and run the script [14:06:55] kart_: I guess that can not be tested at mwdebug1002? or can it? [14:07:13] I mean, before the deployment [14:07:49] zeljkof: script? no. [14:07:59] zeljkof: but it can be run as dry-run. [14:08:12] and then once it is okay, we can add --really flag to run. [14:08:17] kart_: should I run it as dry-run? [14:08:26] zeljkof: default it dry-run. [14:08:42] is* [14:09:08] ok [14:09:18] it has been a while since I have deployed [14:09:32] kart_: do you want to deploy? [14:09:44] do you do deployments? [14:10:02] zeljkof: used to do long time back :/ [14:10:02] I mean, there is no need for me to do it, if you can do it [14:10:16] ok, in that case, I should do it' [14:10:17] ? [14:10:18] (03PS9) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [14:10:21] o/ [14:10:25] zeljkof: That'll be better. [14:10:25] (03PS1) 10Hashar: systemd: update class name in a fail() error [puppet] - 10https://gerrit.wikimedia.org/r/339174 [14:10:27] sorry went busy with some other errand [14:10:27] (03PS1) 10Hashar: rubocop disable Style/PercentLiteralDelimiters [puppet] - 10https://gerrit.wikimedia.org/r/339175 [14:10:29] (03PS1) 10Hashar: systemd: add spec [puppet] - 10https://gerrit.wikimedia.org/r/339176 [14:10:33] our savior hashar is here [14:10:34] (03PS10) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [14:10:35] kart_: ok, will do [14:10:40] hashar: want to do swat? [14:10:49] I have rebased one change so far [14:10:53] (03PS1) 10Volans: Match the whole string for hosts regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/339177 (https://phabricator.wikimedia.org/T158746) [14:10:53] just do it ? [14:11:01] hashar: sure, just asking [14:11:10] it is part of our responsibility, so no point in spending ages figuring out who would do :} [14:11:12] I'm here, didn't noticed the ping... [14:11:21] zeljkof, ^ [14:11:23] I am around to assist as needed [14:11:27] Urbanecm: ok, stay tuned, you are next [14:11:30] Ok. [14:11:47] the throttle one is straight forward [14:11:47] hashar: thanks, might need help with the script, I rarely do that, but will probably figure it out [14:12:05] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338790 (https://phabricator.wikimedia.org/T157114) (owner: 10KartikMistry) [14:12:43] (03CR) 10jerkins-bot: [V: 04-1] systemd: add spec [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [14:12:45] (03CR) 10Hashar: [C: 031] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339172 (https://phabricator.wikimedia.org/T158762) (owner: 10Urbanecm) [14:13:07] zeljkof: if you want to smoke test the throttle change, I usually head to mwdebug1001 and attempt to login [14:13:29] but really that one is just all fine and you can push it directly to prod [14:13:39] the few issues those rules might have are supposedly 100% covered by tests [14:14:15] I have done plenty of those, no problems so far, as far as I know [14:14:37] (03Merged) 10jenkins-bot: Deploy Compact Language Links in Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338790 (https://phabricator.wikimedia.org/T157114) (owner: 10KartikMistry) [14:14:42] for "Compact Language Links" I don't know what should be the sequence [14:14:56] hashar: deploy patch + run script. [14:15:00] kart_: should we run the script before it is deployed? [14:15:04] (03CR) 10jenkins-bot: Deploy Compact Language Links in Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338790 (https://phabricator.wikimedia.org/T157114) (owner: 10KartikMistry) [14:15:09] hashar: after [14:15:11] k :} [14:15:16] so sync the dblist [14:15:23] then run on terbium I guess [14:16:26] yes [14:17:04] !log temporary raising high/low watermarks on elasticsearch eqiad to allow allocation of all shards [14:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:11] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [14:17:31] !log zfilipin@tin Synchronized dblists/compact-language-links.dblist: SWAT: [[gerrit:338790|Deploy Compact Language Links in Swedish Wikipedia (T157114)]] (duration: 00m 50s) [14:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:36] T157114: Deploy Compact Language Links in Swedish Wikipedia - https://phabricator.wikimedia.org/T157114 [14:17:39] (03PS3) 10Zfilipin: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339172 (https://phabricator.wikimedia.org/T158762) (owner: 10Urbanecm) [14:17:53] !log Nuked Jenkins workspaces for the job operations-puppet-typos [14:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:10] kart_: patch deployed, running script... [14:18:17] 7 Failed connecting to redis server at rdb1007.eqiad.wmnet: Bad file descriptor in /srv/mediawiki/php-1.29.0-wmf.12/includes/libs/redis/RedisConnectionPool.php on line 235 [14:18:17] bah [14:18:21] tant is unrelated to the swat [14:18:25] that [14:20:08] zeljkof: deployment of cll is Okay. Let me know when script is done. [14:20:36] zeljkof: also do you've live 'screen' to see output? [14:20:58] kart_: sorry, what do I need to run? [14:21:11] exactly the code from here? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T1400 [14:21:26] (I run scripts rarely, maybe once or twice so far) [14:21:36] zeljkof: as mention in the page. extensions/UniversalLanguageSelector/maintenance/ULSCompactLinksDisablePref.php --wiki svwiki [14:21:41] from wmf12 branch [14:21:54] that is the part I have missed [14:21:59] and then see output. This will be dryrun [14:22:03] (03PS2) 10Gehel: elasticsearch: don't send logs to the console [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) [14:22:18] and if you can paste that output somewhere, that will be nice. [14:22:52] there might be quite a lot of output, might want to capture that in a file [14:22:52] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3046694 (10Gehel) After discussion with @dcausse, it seems to be a better idea to not send any logs to the console, so as to not mi... [14:23:03] Nikerabbit: oh. right. [14:23:08] kart_: will do, to the task [14:23:37] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:23:43] zeljkof: private pastebin is fine too. [14:23:57] kart_: I prefer phab, is that a problem? [14:24:04] should it be private? [14:24:07] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:24:10] Nikerabbit: is that fine? [14:24:58] zeljkof: see amount of output and decide. No worries I guess. [14:25:42] (03PS1) 10Volans: Allow to ignore selected urllib3 warnings [software/cumin] - 10https://gerrit.wikimedia.org/r/339179 (https://phabricator.wikimedia.org/T158758) [14:25:51] zeljkof: you need to use mwscript --wiki svwiki [14:25:57] zeljkof: forgot to add that :) [14:26:21] kart_: thanks, I have figured it out, just about to ask if that is what I need to do :) [14:26:35] only user id's will be printed by the script [14:26:42] I think those are safe [14:26:58] kart_: so to make sure, this is exactly what I need to do? [14:27:03] zfilipin@terbium:/srv/mediawiki-staging/php-1.29.0-wmf.12$ mwscript extensions/UniversalLanguageSelector/maintenance/ULSCompactLinksDisablePref.php --wiki svwiki [14:27:23] yep [14:27:37] running [14:27:49] any estimate on how long it could take? [14:27:52] ok, done :) [14:27:56] Let me know once output is done. [14:27:58] (03PS1) 10Volans: Cumin: disable urllib3 SubjectAltNameWarning [puppet] - 10https://gerrit.wikimedia.org/r/339180 (https://phabricator.wikimedia.org/T158758) [14:28:06] zeljkof: can you paste output somewhere? [14:28:56] kart_: https://phabricator.wikimedia.org/T157114#3046718 [14:29:58] kart_: let me know if it looks good and if I can run it for real [14:30:18] !log resetting to usual values for low/high watermark on elasticsearch eqiad (75% / 80%) [14:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:16] zeljkof: give me a minute or two [14:31:22] sure [14:33:24] zeljkof: go ahead. [14:33:35] we [14:33:40] are good* [14:34:45] kart_: sorry, can not find what I need to add to the end [14:34:53] --for-realz-now-plz? [14:34:54] ;) [14:35:10] found it [14:35:10] --really [14:35:11] --really [14:35:31] so: zfilipin@terbium:/srv/mediawiki-staging/php-1.29.0-wmf.12$ mwscript extensions/UniversalLanguageSelector/maintenance/ULSCompactLinksDisablePref.php --wiki svwiki --really [14:35:39] kart_: correct? ^ [14:35:47] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:35:50] yes [14:35:54] ok, running [14:36:38] (03PS1) 10Marostegui: realm.pp: Add __wmf_checksum table to be ignored [puppet] - 10https://gerrit.wikimedia.org/r/339182 (https://phabricator.wikimedia.org/T154485) [14:37:37] (03PS2) 10Marostegui: realm.pp: Add __wmf_checksums table to be ignored [puppet] - 10https://gerrit.wikimedia.org/r/339182 (https://phabricator.wikimedia.org/T154485) [14:38:15] kart_: done, will paste the output [14:39:06] kart_: https://phabricator.wikimedia.org/T157114#3046756 [14:39:14] thanks for flying with #releng [14:39:23] Urbanecm: your patch is next [14:39:30] zeljkof: thanks! [14:39:32] zeljkof, okay, around [14:39:45] zeljkof: can you do a favour? [14:39:52] Urbanecm: there is nothing for you to do, right? I just deploy ans we pray? ;) [14:39:55] zeljkof: can you delete output from the tast? [14:39:58] both [14:40:05] kart_: sure, will do [14:40:14] kart_: just delete my commetns? [14:40:19] zeljkof: yep [14:40:53] zeljkof: other also :) [14:41:01] kart_: done, please check if you can see them [14:41:05] cool. Thanks. [14:41:40] zeljkof, I don't know anything :) [14:42:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339172 (https://phabricator.wikimedia.org/T158762) (owner: 10Urbanecm) [14:43:28] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339172 (https://phabricator.wikimedia.org/T158762) (owner: 10Urbanecm) [14:43:54] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339172 (https://phabricator.wikimedia.org/T158762) (owner: 10Urbanecm) [14:46:10] ooooh its swat time :O [14:46:24] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/5531/" [puppet] - 10https://gerrit.wikimedia.org/r/339182 (https://phabricator.wikimedia.org/T154485) (owner: 10Marostegui) [14:46:36] addshore: almost done [14:47:07] oh, looks like I don't actually have anything to add! [14:47:23] (03PS1) 10Volans: Cumin: authorize also cumin masters IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) [14:48:49] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10dcausse) Comparing logs available on mwlog1001 I see that the following are missing: - analysis (empty directory created in 2015) - apache2.log - hhvm.log (used by fatalmonitor) - memcache... [14:49:03] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:339172|New throttle rule (T158762)]] (duration: 00m 41s) [14:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:09] T158762: Lift IP rate limit - Editathon - 2017-03-11 - https://phabricator.wikimedia.org/T158762 [14:49:22] Urbanecm: deployed, thanks for flying with #releng [14:50:23] ;} [14:50:33] addshore: need something to be deployed? [14:50:42] !log finished EU SWAT [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:30] 06Operations, 10Traffic, 07Mobile: Samsung Internet's desktop mode getting redirected to mobile site - https://phabricator.wikimedia.org/T158599#3046781 (10ema) p:05Triage>03Normal [14:51:37] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:54:58] zeljkof, thank you! [14:55:31] (03CR) 10Hashar: [C: 031] "Definitely a noop for puppet compilations. That is solely for CI / local testing." [puppet] - 10https://gerrit.wikimedia.org/r/339175 (owner: 10Hashar) [14:56:50] (03CR) 10Jcrespo: [C: 031] realm.pp: Add __wmf_checksums table to be ignored [puppet] - 10https://gerrit.wikimedia.org/r/339182 (https://phabricator.wikimedia.org/T154485) (owner: 10Marostegui) [14:57:28] (03CR) 10Marostegui: [C: 032] realm.pp: Add __wmf_checksums table to be ignored [puppet] - 10https://gerrit.wikimedia.org/r/339182 (https://phabricator.wikimedia.org/T154485) (owner: 10Marostegui) [14:57:55] (03CR) 10Hashar: "My aim here is to have some basic spec coverage of the systemd defines, specially exercising the .erb templates. My motivation is change" [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [15:00:32] hashar: no, I thought I had a ticket, but it turns out it has not yet been created! [15:03:47] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:05:47] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:06:21] (03PS5) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define [puppet] - 10https://gerrit.wikimedia.org/r/333247 [15:09:43] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3046822 (10Cmjohnson) 05Open>03Resolved The disk has been swapped...ready for re-install----resolving this task [15:09:46] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3046824 (10Cmjohnson) [15:09:49] (03PS4) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [15:10:26] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/#/c/339176/ which introduce spec tests for the systemd module. Added a spec to assert th" [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [15:11:17] (03PS1) 10Jcrespo: mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339187 (https://phabricator.wikimedia.org/T147747) [15:11:25] (03CR) 10jerkins-bot: [V: 04-1] systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [15:11:32] (03PS12) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [15:11:57] !log Restart MySQL on db1069 to apply new replication filters - https://phabricator.wikimedia.org/T154485 [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] (03PS5) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [15:13:14] (03CR) 10Hashar: "Fixed rubocop" [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [15:13:41] 06Operations, 10ops-eqiad, 06DC-Ops: 1624-Power Supply Unplugged - Power Supply 1 is unplugged - elastic1043 - https://phabricator.wikimedia.org/T158749#3046827 (10Cmjohnson) 05Open>03Resolved The psu cable was not locked in place. Fixed. - Resolving [15:14:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339187 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [15:14:56] (03CR) 10jenkins-bot: mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339187 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [15:16:46] 06Operations, 10media-storage, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Enable HTTPS for Swift traffic - https://phabricator.wikimedia.org/T127455#3046830 (10fgiunchedi) [15:16:59] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1026 for maintenance (duration: 00m 41s) [15:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:30] (03PS1) 10Filippo Giunchedi: hieradata: use_tls for swift proxy in codfw [puppet] - 10https://gerrit.wikimedia.org/r/339191 (https://phabricator.wikimedia.org/T127455) [15:19:47] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:46] !log Restart MySQL on db1095 to apply new replication filters - https://phabricator.wikimedia.org/T154485 [15:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:17] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:34] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/5534/" [puppet] - 10https://gerrit.wikimedia.org/r/339191 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [15:28:37] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/ms-fe.svc.codfw.wmnet.crt] [15:29:09] (03PS1) 10Filippo Giunchedi: Use .crt for ms-fe.svc certs [puppet] - 10https://gerrit.wikimedia.org/r/339192 [15:30:03] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Use .crt for ms-fe.svc certs [puppet] - 10https://gerrit.wikimedia.org/r/339192 (owner: 10Filippo Giunchedi) [15:30:37] PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/ms-fe.svc.codfw.wmnet.crt] [15:30:48] puppet compiler + fileserver = not always joy [15:32:35] 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3046851 (10Cmjohnson) I confused this server with something else. This server has 12 SSDs that we purchased and placed in the system in December 2016. Strange that one failed aft... [15:32:37] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:33:37] RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:33:47] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:34:38] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3046855 (10elukey) First step of the MW rebalancing done. This is... [15:38:35] (03PS2) 10Volans: Cumin: authorize also cumin masters IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) [15:40:18] (03PS11) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [15:42:27] (03CR) 10DCausse: elasticsearch: don't send logs to the console (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) (owner: 10Gehel) [15:42:59] (03PS12) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [15:43:16] (03PS1) 10Giuseppe Lavagetto: Syncer: better logging [software/conftool] - 10https://gerrit.wikimedia.org/r/339194 [15:43:18] (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/339195 [15:43:37] !log stopping mariadb replication on db1026 for maintenance T147747 [15:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:23] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [15:44:26] (03CR) 10Volans: "Puppet compiler diffs available at: https://puppet-compiler.wmflabs.org/5535/" [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) (owner: 10Volans) [15:45:41] (03PS3) 10Gehel: elasticsearch: don't send logs to the console [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) [15:46:03] (03CR) 10Gehel: "Thanks David (/me needs a new brain)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) (owner: 10Gehel) [15:47:47] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:47:59] (03CR) 10DCausse: [C: 031] elasticsearch: don't send logs to the console [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) (owner: 10Gehel) [15:48:06] !log installing tcpdump security updates on ubuntu systems (jessie already fixed for a while) [15:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:13] (03PS13) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [15:50:08] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 200262 [15:50:27] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:50:40] hi im getting this error [15:50:41] Feb 22 15:50:02 phabricator ferm[12259]: Starting Firewall: fermError in /etc/ferm/conf.d/10_ssh-from-cumin-masters line 4: [15:50:42] Feb 22 15:50:02 phabricator ferm[12259]: & R_SERVICE [15:50:42] Feb 22 15:50:02 phabricator ferm[12259]: ( [15:50:42] Feb 22 15:50:02 phabricator ferm[12259]: tcp , 22 , $ CUMIN_MASTERS <-- [15:50:42] Feb 22 15:50:02 phabricator ferm[12259]: no such variable: $CUMIN_MASTERS [15:50:43] Feb 22 15:50:02 phabricator ferm[12259]: failed! [15:50:44] Feb 22 15:50:02 phabricator systemd[1]: ferm.service: control process exited, code=exited status=25 [15:50:45] Feb 22 15:50:02 phabricator systemd[1]: Failed to start LSB: ferm firewall configuration. [15:50:46] Feb 22 15:50:02 phabricator systemd[1]: Unit ferm.service entered failed state. [15:50:47] for ferm. [15:50:57] paladox: where? [15:51:03] that file i have not touched. [15:51:04] volans ferm [15:51:08] which host [15:51:08] it fails to start, failed puppet. [15:51:16] phabricator (labs instance) [15:51:27] labs? [15:51:34] ACKNOWLEDGEMENT - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 204220 Ema Checking [15:51:34] ACKNOWLEDGEMENT - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 200262 Ema Checking [15:51:40] let me double check [15:51:43] volans yeh, it wasent failing yesturday [15:52:44] CUMIN_MASTERS isn't defined when running in labs [15:52:44] CUMIN_MASTERS was not defined for labs realm, because it doesn't apply [15:52:55] which role is it running paladox ? [15:53:03] volans the phabricator role [15:53:04] moritzm: will ferm complain with an empty array? [15:53:17] ferm complains for almost everything :-) [15:53:28] encouraging :) [15:53:48] I think checking the realm and only applying on production would be best [15:53:58] until we have cumin for labs [15:54:04] I was thinking the same, but we apply only in standard [15:54:10] (03PS7) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [15:54:20] the role phabricator::main should not include standard [15:54:22] a few roles include standard IIRC [15:54:33] on site.pp is explicitely added afterwards [15:54:37] PROBLEM - DPKG on labvirt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:54:43] s/on/in/ [15:54:53] paladox: I'll send a fix in few minutes [15:54:59] volans thanks :) [15:55:03] thanks for notifying me [15:55:20] Your welcome :) [15:55:49] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1049 - https://phabricator.wikimedia.org/T158761#3046882 (10Cmjohnson) The disk has been swapped and is rebuilding Enclosure Device ID: 32 Slot Number: 4 Drive's position: DiskGroup: 0, Span: 2, Arm: 0 Enclosure position: N/A Device Id: 4 WWN: 5000C5005E8... [15:56:18] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1049 - https://phabricator.wikimedia.org/T158761#3046883 (10Marostegui) Thanks Chris, I will keep an eye on it and close the ticket once it is finished! [15:56:31] (03PS1) 10Filippo Giunchedi: lvs: add swift https service [puppet] - 10https://gerrit.wikimedia.org/r/339197 (https://phabricator.wikimedia.org/T127455) [15:56:37] RECOVERY - DPKG on labvirt1001 is OK: All packages OK [15:59:53] (03CR) 10Paladox: [C: 031] "Tested on the puppetmaster and works. This should not have any impact on prod." [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [16:01:37] (03PS2) 10Filippo Giunchedi: lvs: add swift https service [puppet] - 10https://gerrit.wikimedia.org/r/339197 (https://phabricator.wikimedia.org/T127455) [16:02:01] (03PS1) 10Volans: Cumin: include cumin::target only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/339198 (https://phabricator.wikimedia.org/T158773) [16:02:44] moritzm, paladox ^^^ [16:04:26] (03CR) 10Volans: "NOOP as expected on a production host: https://puppet-compiler.wmflabs.org/5537/" [puppet] - 10https://gerrit.wikimedia.org/r/339198 (https://phabricator.wikimedia.org/T158773) (owner: 10Volans) [16:04:28] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/339198 (https://phabricator.wikimedia.org/T158773) (owner: 10Volans) [16:04:59] (03PS2) 10Volans: Cumin: include cumin::target only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/339198 (https://phabricator.wikimedia.org/T158773) [16:06:09] (03CR) 10Volans: [C: 032] Cumin: include cumin::target only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/339198 (https://phabricator.wikimedia.org/T158773) (owner: 10Volans) [16:07:47] paladox: I'm not sure when/how labs puppetmaster is sync'ed but this should have fixed it ^^^ [16:07:52] please let me know if it doesn't [16:09:09] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:25] <_joe_> volans: ewww [16:09:39] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:40] <_joe_> volans: why is labs not going to have cumin? [16:10:09] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 5257 [16:10:13] _joe_: with which masters? [16:10:35] <_joe_> volans: I guess that's up for the labs team to decide [16:11:17] I'll ping them, of course, but for now [16:14:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339200 (https://phabricator.wikimedia.org/T158194) [16:15:02] !log cp4019 upgraded to varnish 4.1.5 [16:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:21] volans thanks. [16:16:25] i will check now [16:16:43] thx [16:17:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339200 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [16:18:22] (03PS1) 10Tim Landscheidt: Update links to Tool Labs apt repository [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/339201 (https://phabricator.wikimedia.org/T158383) [16:19:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339200 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [16:19:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339200 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [16:19:22] volans still fails [16:19:52] !log cp3006 upgraded to varnish 4.1.5 [16:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:23] paladox: do you have you own puppetmaster or are you using labs's one? [16:20:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 - T158194 (duration: 00m 40s) [16:20:50] volans i have my own puppetmaster which i just did a git pull and then ran puppet agent on the phabricator instance. [16:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:55] T158194: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194 [16:21:42] paladox: if you check your root's crontab there is a script to do it [16:21:47] /usr/local/bin/git-sync-upstream [16:21:51] !log Shutdown db1060 for BBU replacement - T158194 [16:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:27] paladox: does it give you any error? do you have the last commit? [16:24:12] volans yep, already ran.root@puppet-phabricator:/var/lib/git/operations/puppet# /usr/local/bin/git-sync-upstream [16:24:12] --- /usr/local/bin/git-sync-upstream 2017-02-22T16:24:06 [16:24:12] Local diffs detected. Commit your changes! [16:24:28] if you have local diffs it cannot rebase [16:24:32] commit or stash them [16:24:37] oh, but i have none [16:24:40] i did git reset [16:24:47] and applied https://gerrit.wikimedia.org/r/#/c/338294/13 [16:25:11] volans when doing git reset it says [16:25:12] HEAD is now at 09fb016 Cumin: include cumin::target only in production realm [16:25:34] are you in a detached head state? [16:25:37] in a middle of a rebase? [16:25:45] there are conflicts? [16:25:59] volans no conflicts and no not in a middle of rebase [16:26:03] and no detached head. [16:26:05] (03PS14) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [16:26:59] volans ah, do i delete /etc/ferm/conf.d/10_ssh-from-cumin-masters [16:27:04] as will it be applied now? [16:27:18] moritzm: ferm::service does a rewrite of the whole /etc/ferm/conf.d directory or just add stuff? [16:27:36] it empties older files automatically [16:27:36] paladox: is what I was thinking, it might be that it doesn't remove it :( [16:27:49] Oh ok, so i will remove it :) [16:27:50] moritzm: nice, so it should work [16:27:51] (or non-puppetised files) [16:27:52] ack [16:28:01] strange... [16:29:01] volans it re added the file [16:29:02] !log reimage of relforge1001 starting [16:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:15] Notice: /Stage[main]/Base::Firewall/Ferm::Service[ssh-from-cumin-masters]/File[/etc/ferm/conf.d/10_ssh-from-cumin-masters]/ensure: created [16:29:30] paladox: damn, right I know why [16:29:35] my fault [16:29:37] oh [16:30:48] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3047007 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['relforge1001.eqiad.wmnet'] ``` The... [16:32:13] (03PS1) 10Volans: Cumin: enable ferm service only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/339206 (https://phabricator.wikimedia.org/T158773) [16:33:29] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.154 second response time [16:33:46] moritzm: cumin's fix for the ferm part ^^^ [16:34:09] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 1, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 259, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 8 [16:35:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:21] ^relforge is me, sorry for the noise... [16:35:53] ACKNOWLEDGEMENT - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 1, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 259, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_n [16:36:14] (03CR) 10Paladox: [C: 031] Cumin: enable ferm service only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/339206 (https://phabricator.wikimedia.org/T158773) (owner: 10Volans) [16:36:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339207 [16:37:08] (03CR) 10Marostegui: [C: 04-1] "Wait for the server to be back from maintenance and the lag to be gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339207 (owner: 10Marostegui) [16:37:40] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:38:09] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:38:19] volans tested your new fix and it works, but it seems ferm does active (exited) [16:38:32] but i doint see any other error when i do sudo service ferm status [16:39:09] that means that last time it exited, if you restart again it goes away [16:39:17] if I understood what you're referring to [16:40:09] volans oh, doing a restarts and doing sudo service ferm status still says active (exited) [16:40:09] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:40:40] (03PS2) 10Giuseppe Lavagetto: Syncer: better logging [software/conftool] - 10https://gerrit.wikimedia.org/r/339194 [16:40:43] (03PS2) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/339195 [16:41:02] paladox: ah that one, sorry, that's normal [16:41:07] i will just restart the instance to see if that fixes it. [16:41:23] (03CR) 10jerkins-bot: [V: 04-1] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/339195 (owner: 10Giuseppe Lavagetto) [16:41:25] (03CR) 10jerkins-bot: [V: 04-1] Syncer: better logging [software/conftool] - 10https://gerrit.wikimedia.org/r/339194 (owner: 10Giuseppe Lavagetto) [16:41:40] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/339206 (https://phabricator.wikimedia.org/T158773) (owner: 10Volans) [16:41:52] (03CR) 10Paladox: [C: 031] "Tested on the puppetmaster and it fixes it on labs :)" [puppet] - 10https://gerrit.wikimedia.org/r/339206 (https://phabricator.wikimedia.org/T158773) (owner: 10Volans) [16:42:10] paladox: it's the correct state [16:42:16] yep [16:42:23] (03CR) 10Volans: [C: 032] Cumin: enable ferm service only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/339206 (https://phabricator.wikimedia.org/T158773) (owner: 10Volans) [16:44:02] (03PS15) 10Paladox: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 [16:46:26] volans is it meant to say this [16:46:27] Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root.d]: Not removing directory; use 'force' to override [16:46:32] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3047091 (10fgiunchedi) Thanks @dcausse for the audit! `memcached-keys.log` should have been there after I switched `udplog` CNAME, I've roll-restarted rsyslog on mc hosts. I'll investigate apache2/... [16:46:58] (03PS3) 10Giuseppe Lavagetto: Syncer: better logging [software/conftool] - 10https://gerrit.wikimedia.org/r/339194 [16:47:00] (03PS3) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/339195 [16:47:15] paladox: is it empty? [16:47:19] yep [16:47:26] drwxr-xr-x 2 root root 4096 Feb 22 16:18 . [16:47:26] dr-xr-xr-x 3 root root 4096 Feb 22 14:56 .. [16:48:36] volans ^^ [16:48:57] that's a part I didn't thouched, is how the ssh::userkyes module works [16:49:06] apparently, I need to check [16:49:35] ok [16:52:18] volans i see https://github.com/wikimedia/puppet/commit/bf7430edc98418abaebe6d5d7ff9ed500acc65ae which dosent change anything to do with /etc/ssh/userkeys/ but it changes a param that has /etc/ssh/userkeys/ in it. [16:52:30] authorized_keys_file: /etc/ssh/userkeys/%u /etc/ssh/userkeys/%u.d/cumin /etc/ssh/userkeys/%u.d/labstore [16:52:57] yes, I know [16:53:56] and https://github.com/wikimedia/puppet/blob/bf7430edc98418abaebe6d5d7ff9ed500acc65ae/modules/ssh/manifests/server.pp#L25 [16:54:46] that's ok, is where an sshd server will look for authorized_keys [16:54:55] it doesn't have to exists [16:54:59] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3047097 (10Dzahn) >>! In T158656#3045110, @Paladox wrote: > There being replaced with private edits :) I don't anticipate ever needing a private edit. Really private th... [16:55:09] ok [16:55:09] but if it exists sshd will consider those authorized_keys too [16:55:12] ok [16:55:32] the thing I don't know is the not removal of the empty directory [16:55:43] but I cannot look at it right now, meeting in few mintues [16:56:08] (03PS1) 10Volans: Add support for 'not' in simple hosts selection [software/cumin] - 10https://gerrit.wikimedia.org/r/339213 (https://phabricator.wikimedia.org/T158748) [16:56:34] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3047103 (10Paladox) @Dzahn private edit's would normally be used for creating your first inline edit change i presume. https://gerrit-review.googlesource.com/#/c/98134/ [16:56:34] ok [16:59:21] (03CR) 10Giuseppe Lavagetto: [C: 032] Syncer: better logging [software/conftool] - 10https://gerrit.wikimedia.org/r/339194 (owner: 10Giuseppe Lavagetto) [17:00:09] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 7 failures [17:00:09] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 26 failures [17:00:29] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.147 second response time [17:01:17] rigel/bellatrix pupeptrun are ok, it's just a puppetmaster kernel rebot [17:01:30] err "puppetmaster reboot for kernel update" [17:01:36] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3047107 (10Dzahn) >>! In T158656#3047103, @Paladox wrote: > @Dzahn private edit's would normally be used for creating your first inline edit change i presume. That's ok,... [17:02:21] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3047110 (10Paladox) @Dzahn refresh, it sometimes fails to load. But you need javascript enabled to view the site. [17:04:27] 06Operations, 10Traffic: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3047116 (10BBlack) [17:04:30] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3047111 (10BBlack) 05Open>03Resolved a:03BBlack Singapore approved by Legal and selected, which is pretty much our ideal candidate on a range of issues. [17:05:09] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 26 failures [17:05:09] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 7 failures [17:09:45] 06Operations, 06Discovery, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10matmarex) [17:10:09] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [17:10:09] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:17:19] 06Operations, 06Discovery, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047159 (10Paladox) p:05Triage>03High [17:17:35] 06Operations, 06Discovery, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047161 (10Paladox) p:05High>03Unbreak! [17:18:18] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043892 (10CaitVirtue) Added Danny as a subscriber, as he's the event lead. [17:19:58] 06Operations, 06Discovery, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047176 (10Paladox) The only update i see is https://phabricator.wikimedia.org/rWPOR13e001a4611b9c75d1e8538b55d6a25bfa4b3bd0 [17:24:01] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3047213 (10bd808) >>! In T123728#3047091, @fgiunchedi wrote: > I'll investigate apache2/hhvm logs These are both handled by the `mediawiki::rsyslog` Puppet class. That class sets up local rsyslog ru... [17:24:26] 06Operations, 06Discovery, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047216 (10Paladox) I get this error in the developer tools [Error] ReferenceError: Can't find variable: doWhenReady (anonymous function) (gt-ie9-c84bf66d33.js... [17:25:49] RECOVERY - MegaRAID on db1049 is OK: OK: optimal, 1 logical, 2 physical [17:26:08] :-) [17:26:08] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3047217 (10Papaul) [17:26:29] that was fast [17:29:53] 06Operations, 06Discovery, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10ema) Note that the JS file mentioned above loads fine adding a query argument: https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-d1cc91a7... [17:30:21] 06Operations, 06Discovery, 10Traffic, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047244 (10ema) [17:36:45] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3047251 (10jcrespo) The user impact/dependencies are not 100% clear for this maintenance, which will be the long one (maybe a couple of days), so I requested help of some... [17:37:33] 06Operations, 06Discovery, 10Traffic, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047254 (10ema) And the redirect comes from mediawiki: ``` < HTTP/1.1 301 Moved Permanently < Date: Wed, 22 Feb 2017 17:33:13 GMT < Content-Type:... [17:43:36] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3047284 (10RobH) [17:44:04] 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3004739 (10GWicke) Are there any spares of this disk type at hand? [17:46:42] 06Operations, 06Discovery, 10Traffic, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047301 (10Paladox) Is this a mediawiki core change causing this problem? Should #mediawiki-general-or-unknown be added? [17:46:47] 06Operations, 06Discovery, 10Traffic, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10Dzahn) also see T128546 [17:48:31] (03CR) 10Mobrovac: [C: 031] systemd: update class name in a fail() error [puppet] - 10https://gerrit.wikimedia.org/r/339174 (owner: 10Hashar) [17:48:38] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3047316 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['relforge1001.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['relforge1001.e... [17:49:16] 06Operations, 06Discovery, 10Traffic, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10Dzahn) @Paladox Seems more likely it's related to the last deploy on T128546, but that's possible too. [17:49:27] 06Operations, 06Discovery, 10Traffic, 10Wikimedia-Portals: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10Dereckson) There is a .jsl10n CSS rule with `visibility: hidden`. I guess JS code should show it, but it's not the case. [17:49:44] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047322 (10debt) [17:50:05] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047323 (10Paladox) Oh, ok, i will add the project just in case. We can always remove it later if it turns on that it was something... [17:50:15] 06Operations, 06Discovery, 10MediaWiki-General-or-Unknown, 10Wikimedia-Portals, 03Discovery-Portal-Sprint: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047324 (10Paladox) [17:51:11] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [17:51:41] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:41] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:41] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:42] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:51] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:51] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:51] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:52] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:00] marostegui ^^ [17:52:01] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:01] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:01] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:01] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:02] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:11] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:11] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:12] don't worry [17:52:21] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:21] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:31] things are up, just a bit slow due to backups [17:52:31] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:31] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:31] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:53:01] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [17:53:01] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:53:01] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:53:01] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:53:01] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:53:02] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:53:49] 06Operations, 06Discovery, 10MediaWiki-General-or-Unknown, 10Wikimedia-Portals, 03Discovery-Portal-Sprint: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047332 (10Dereckson) Last change to have touched the JS code seems to be 78a8d57ff46e. [17:53:50] it was just a glitch by nagios queries piling up [17:54:21] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047334 (10Dereckson) [ Removed tag as not served by MediaWiki software, but by custom code in the portals repository ] [17:55:44] (03PS1) 10EddieGP: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [17:55:53] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) (owner: 10EddieGP) [17:56:01] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:01] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:01] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:01] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:11] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:11] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:19] That is probably because of the backuops, I will check it now [17:56:21] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:56:21] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:56:21] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:56:31] yeah, there is a race condition [17:56:32] (03PS2) 10EddieGP: New throttle rule, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [17:56:41] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) (owner: 10EddieGP) [17:57:36] tendril, prometheus, dumps and nagios at the same time, fighting with each other [17:58:11] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:58:32] (03PS3) 10EddieGP: New throttle rule, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [17:58:41] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:41] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:51] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:51] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:51] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:51] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:14] let me downtime those [17:59:21] so we at least avoid the spam [17:59:21] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [17:59:21] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [17:59:31] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:31] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:31] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:01:08] (03PS1) 10Marostegui: db-eqiad.php: Repool db1060 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339226 (https://phabricator.wikimedia.org/T158194) [18:01:50] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1049 - https://phabricator.wikimedia.org/T158761#3047356 (10Marostegui) 05Open>03Resolved All good now! Thanks! ``` root@db1049:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL Device(Encl-32 Slot-4) is not in rebuild process Exit Code: 0x00 root@d... [18:02:01] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:02:01] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:02:24] (03PS1) 10Dereckson: Rollback www.wikipedia.org portal code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339227 (https://phabricator.wikimedia.org/T158782) [18:02:54] jouncebot: next [18:02:54] In 0 hour(s) and 57 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T1900) [18:03:29] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194#3047361 (10Marostegui) Thanks @Cmjohnson - the BBU now looks good! ``` root@db1060:~# megacli -AdpBbuCmd -aAll BBU status for Adapter: 0 BatteryType: BBU Voltage: 3937 mV Current: 468... [18:03:41] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:03:41] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 92.04 seconds [18:03:41] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:03:41] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87322.04 seconds [18:03:41] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:03:42] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:03:42] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:03:43] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87406.36 seconds [18:03:43] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87514.67 seconds [18:03:44] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [18:03:45] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 84898.67 seconds [18:03:45] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194#3047362 (10Marostegui) 05Open>03Resolved [18:03:45] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:03:45] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86977.69 seconds [18:03:46] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:03:51] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:03:51] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:03:51] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:03:51] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:03:52] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:20] 06Operations, 06Labs: labmon1001: Graphite + Mirantis' Django don't play well together - https://phabricator.wikimedia.org/T158789#3047363 (10faidon) [18:04:21] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:21] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:21] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:05:53] jynus: hi, I imagine you currently want to get full access to tin /srv/mediawiki-staging for MySQL pooling? [18:06:39] I do not want that [18:06:44] marostegui, jynus: maybe you could run those pigz with some nice/ionice to avoid this spam, not sure how much speed vs noise is more important here ;) [18:06:58] volans, normally that doesn't happen [18:07:28] To fix www.wikipedia.org no text issue, we could rollback the new JS code introduced at the last update, by rollbacking the portals repo to previous state. [18:07:31] in fact, io is not the problem [18:07:41] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:07:49] volans, there is a race condition on query show slave status [18:08:11] ok [18:08:12] the stop slave + show slave status blocks stugg [18:10:44] Dereckson, maybe you want to ping someone from the search team or maybe release engineering [18:11:27] Dereckson: That's my best suggestion [18:11:34] I pinged discovery re: your change [18:11:38] But I didn't get a response [18:11:40] it's already arrived in #wikimedia-discovery [18:11:41] it seems [18:13:53] (03CR) 10Chad: [C: 032] Rollback www.wikipedia.org portal code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339227 (https://phabricator.wikimedia.org/T158782) (owner: 10Dereckson) [18:15:16] (03Merged) 10jenkins-bot: Rollback www.wikipedia.org portal code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339227 (https://phabricator.wikimedia.org/T158782) (owner: 10Dereckson) [18:15:37] 339227 live on mwdebug1002 [18:15:55] and text appears [18:17:00] !log dereckson@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 40s) [18:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:40] !log dereckson@tin Synchronized portals: (no justification provided) (duration: 00m 39s) [18:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:50] !log Last two deployment entries were to rollback portals/ to last known state (T158782) [18:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:54] T158782: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782 [18:18:15] (03CR) 10jenkins-bot: Rollback www.wikipedia.org portal code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339227 (https://phabricator.wikimedia.org/T158782) (owner: 10Dereckson) [18:18:38] So it's not related to the last deployment, as it works on mwdebug1002 but not live [18:18:46] Dereckson was it deployed to prod? Or just to the test server? [18:19:01] I still see no text at https://www.wikipedia.org/ [18:19:17] worked on mwdebug1002 but not live, indeed [18:19:24] ok [18:19:24] but not in prod [18:19:31] (03CR) 10Dzahn: "have you tested the reboot successfully now?" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [18:20:22] (03PS1) 10Faidon Liambotis: Add an interface_primary fact [puppet] - 10https://gerrit.wikimedia.org/r/339231 [18:20:24] (03PS4) 10EddieGP: New throttle rule, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [18:21:17] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047421 (10Dereckson) The rollback worked on mwdebug1002 (it works on mwdebug1001 too), but doesn't work in prod. [18:21:23] (03PS6) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [18:21:56] (03CR) 10Paladox: "> have you tested the reboot successfully now?" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [18:22:06] (03PS7) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [18:22:11] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:24:05] (03CR) 10Dzahn: [C: 04-1] "make "ssh port" a parameter of the class. avoid doing the hiera lookup inside the module. and inline comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [18:25:07] paladox: was a local (ours) browser issue: .js redirect to 404 was still in cache [18:25:23] oh [18:26:32] (03CR) 10Paladox: [C: 031] Phabricator: Make ssh-phab port configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [18:26:54] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047466 (10Dereckson) p:05Unbreak!>03Normal [18:27:09] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10Dereckson) p:05Normal>03High [18:27:11] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:27:26] (03CR) 10Dzahn: "you should put on this change that there is an existing systemd unit file in another location and that you want to replace it and why. And" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [18:28:23] 06Operations, 10Traffic, 07Mobile: Samsung Internet's desktop mode getting redirected to mobile site - https://phabricator.wikimedia.org/T158599#3047490 (10elukey) Ran some hive queries on webrequest data with @ema, our understanding is that the "Mobile" keyword in the UA should indicate that the TV is askin... [18:29:21] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10debt) Thanks for your quick help, @Dereckson ! [18:29:40] (03PS8) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [18:31:14] (03CR) 10Dzahn: [C: 04-1] "i don't think it's "ugly" or matters but ok. Then just do the part where the hiera lookup is not inside the module. Also i still think th" [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [18:32:29] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047507 (10Paladox) Works now. [18:36:41] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:41:56] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10greg) This needs an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation [18:42:12] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047530 (10greg) (And thanks, @Dereckson :) ) [18:43:07] can I deploy wmf-config/db-eqiad.php Dereckson if you are done? [18:43:11] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:58] marostegui: yes you can [18:44:03] thanks! [18:44:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1060 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339226 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [18:46:03] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1060 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339226 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [18:47:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 with less weight - T158194 (duration: 00m 39s) [18:47:19] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194#3047552 (10Marostegui) Repooled db1060 with less weight (and still not serving API again) so it can warm up a bit. [18:47:21] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1060 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339226 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [18:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:25] T158194: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194 [18:47:48] (03CR) 10Yuvipanda: Update links to Tool Labs apt repository (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/339201 (https://phabricator.wikimedia.org/T158383) (owner: 10Tim Landscheidt) [18:48:35] 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3047577 (10RobH) There are not, however, it was discussed during the ops meeting today to order some. Task T158795 tracks the ordering of replacement disks. [18:51:11] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:56:08] (03CR) 10MarcoAurelio: "Scheduled for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [18:58:18] (03PS2) 10Dzahn: systemd: update class name in a fail() error [puppet] - 10https://gerrit.wikimedia.org/r/339174 (owner: 10Hashar) [18:58:28] (03CR) 10MarcoAurelio: "Scheduled for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) (owner: 10MarcoAurelio) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T1900). [19:00:05] matt_flaschen: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:11] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:00:45] hmm... didn't mentioned me... [19:01:11] I'm here though [19:01:52] Present [19:04:03] I can SWAT [19:05:16] LMK when swat is done, I think I'll need to poke rsyslog to move apache/hhvm logs to mwlog1001 too [19:06:34] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3047628 (10Pchelolo) I've done some testing on what's happening if the partition where we log fills up. I've created a VM with a 100 meg partition and started trace-logging there. Whe... [19:06:46] (03PS4) 10Thcipriani: Configuration changes for wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) (owner: 10MarcoAurelio) [19:07:07] godog: will do [19:07:15] * tabbycat eyes thcipriani -- those wikitech patches can't be checked on mwdebug just fyi [19:08:13] tabbycat: yup, since wikitech is silver only. Thank you for the heads up :) [19:08:37] yep, whatever silver on the tech world means :) [19:09:25] tabbycat: it's a server [19:09:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) (owner: 10MarcoAurelio) [19:09:41] :D [19:09:54] "the silver server" [19:10:00] tabbycat: "silver" is the RFC 1178 machine name of this server, https://tools.ietf.org/html/rfc1178 [19:10:02] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3047643 (10elukey) After the changes done this morning, something *seems* to have changed, namely I don't see anymore `FetchError no... [19:10:59] (03Merged) 10jenkins-bot: Configuration changes for wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) (owner: 10MarcoAurelio) [19:11:07] (03CR) 10jenkins-bot: Configuration changes for wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) (owner: 10MarcoAurelio) [19:11:11] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:12:08] You have currently three popular and concurrent names to name machines: (1) pick arbitrary and unique names (RFC 1178) - e.g. chemical elements for Wikimedia servers like Silver, Tin or Terbium (2) use names, e.g. mwdebug1001 mwdebug1002 (3) uses random generated values or hashes, e.g. the default Docker containers' hostnames [19:12:31] (ways to name) [19:12:55] (1) has a lot of advantages for small installations, but it's not scalable [19:13:01] gold.equiad.wmnet [19:13:04] :) [19:13:37] " Add gold and platinum MAC to dhcp [19:13:42] already used in the past :p [19:14:23] what was the trick to find and replace the selected lines with Sublime text editor? [19:14:24] (03PS5) 10EddieGP: New throttle rule, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [19:14:36] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:338751|Configuration changes for wikitech.wikimedia.org]] T158516 T158554 T158482 (duration: 00m 40s) [19:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:43] T158554: Allow 'contentadmin' being managed by wikitech bureaucrats - https://phabricator.wikimedia.org/T158554 [19:14:43] T158482: Remove shellmanagers group on wikitech - https://phabricator.wikimedia.org/T158482 [19:14:44] T158516: Decide which group(s) will be able to manage 'shell' on wikitech - https://phabricator.wikimedia.org/T158516 [19:14:46] ^ tabbycat sync'd check pleas [19:14:48] e [19:14:51] on it [19:15:33] bblack AaronSchulz SMalyshev Looking for some advice on purging URLs for CentralNotice banner loading (again)... Hoping for some thoughts on both the Varnish and PHP side of things... We have to purge URLs on a single page, but with a lot of permutations of URL params [19:15:41] godog: greenlantern vs. silversurfer [19:15:45] thcipriani: looks good to me [19:15:57] Dunno if there's a good way to do so with a regex or something? I think that's not available from PHP? [19:16:01] PROBLEM - HP RAID on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [19:16:20] We can purge one-by-one, it would purge about 3000 URLs every time a banner is saved [19:16:27] maybe bd808 wants to remove the shellmanagers now via UserRights [19:16:35] so it gets logged [19:16:41] or we can do that later via SQL [19:16:44] mutante: hehe we'd have to find some green element [19:16:45] And that's ignoring the possible issue of differently-ordered URL params (probably not prevalent but might well happen sometimes) [19:16:48] AndyRussG, I know one way is TitleSquidURLs. [19:16:48] if we don't care about logging [19:16:49] ejegg: awight: ^ [19:17:06] matt_flaschen: That's a hook for adding URLs, no? [19:17:27] tabbycat: what do you need me to do? [19:17:32] AndyRussG, yeah, https://www.mediawiki.org/wiki/Manual:Hooks/TitleSquidURLs , but it only works if there is a single Title with multiple URLs. [19:17:49] bd808: I'm removing the shellmanagers group from wikitech as you proposed [19:18:03] AndyRussG, not sure if banners have a Title representation. [19:18:07] matt_flaschen: it is a single title... Hmmm, I only looked at the MobileFrontEnd hook handler... [19:18:10] bd808: and I wonder if we could remove those users from the group before removing the group from the wiki [19:18:22] I've already granted 'crats the ability to manage 'shell' [19:18:35] tabbycat: ah. so you want me to -shellmanagers from the existing users? I can do that [19:18:45] bd808: if that's okay? [19:18:50] tabbycat: could you manually rebase this: https://gerrit.wikimedia.org/r/#/c/338632/ ? gerrit is complaining. [19:18:54] Mmm yes that's also the hard part. They do have a Title base, but what varies per banner, for loading, are the params [19:18:58] thcipriani: looking [19:19:18] thcipriani: okay, will try [19:19:22] AndyRussG, is it all a single special page, with just different parameters? [19:19:53] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3047685 (10EWilfong_WMF) @Jgreen @RobH - Checking in on our next steps here. I would love to get the new cert in place this week so w... [19:19:58] It's all Special:BannerLoader on meta, with banner, campaign, uselang and debug params [19:20:11] matt_flaschen: https://gerrit.wikimedia.org/r/#/c/336237/ is what we have so far.... [19:21:20] AndyRussG, that's basically what I was about to suggest. [19:22:04] thcipriani: please wait for bd808 to finish removing the users [19:22:36] (03PS1) 10Andrew Bogott: Move archive-instances to python2, labmon1001 to liberty [puppet] - 10https://gerrit.wikimedia.org/r/339235 (https://phabricator.wikimedia.org/T158789) [19:22:39] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3047689 (10RobH) @EWilfong_WMF: Just wanted to check, we'll be generating the private key, csr, and ordering the certificate. In t... [19:22:41] tabbycat: ok [19:23:31] matt_flaschen: ah K... :) Yeah the lingering concerns with that one are the possible different ordering of URL params coming from browsers (constructed via a JS object), and whether it's OK to flood Varnish with so many purge (or ban?) requests, and also that a regex would make stuff much simpler [19:23:50] AndyRussG, BTW, what do you mean "one-by-one"? It seems there is multi support that feeds into. : https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/deferred/CdnCacheUpdate.php;0366498177584befce06fc6cc4b1913dea904e5e$116 [19:24:02] dbstore2001 is me [19:24:04] matt_flaschen: your change is live on mwdebug1002, check please [19:24:22] it just has too much io [19:24:43] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3047695 (10EWilfong_WMF) Perfect. ewilfong@trilogyinteractive.com [19:25:05] jynus: 1001 or 2001? [19:25:08] (03PS7) 10MarcoAurelio: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) [19:25:56] thcipriani, tabbycat: shellmanagers is empty now. :) [19:26:07] 2001 [19:26:19] bd808: nice thanks :) [19:26:28] bd808: thanks thcipriani patch rebased waiting for zuul [19:26:32] I sent you an email [19:26:58] gotcha [19:26:59] thanks :) [19:27:01] * tabbycat just wants to say that it's nice to work with you [19:27:08] tabbycat: ok, will get out the flow change, and then that patch [19:27:16] sure [19:27:32] marostegui, I would literally downtime raids and go to have some rest [19:27:49] matt_flaschen: ah yes, we do send all the URLs in a single job. But if we could do some regex magic, we wouldn't have to iterate about like three's no tomorrow to get all the permutations of URL param values [19:27:50] raids? [19:28:11] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 22, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 259, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 91.3333333333, active_shards: 27 [19:28:11] the raid check will complain because it timeouts [19:28:16] aah [19:28:32] that is what we have to suffer from buying HDs [19:28:35] Let's see if those backups finish already and we can stop the db :( [19:28:36] instead of SSDs [19:28:49] one last thing [19:29:01] when bringing down dbstore, if you do that [19:29:11] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:29:12] remember to stop all slaves, disable events, stops slaves again [19:29:14] (03PS1) 10Chad: Group1 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339236 [19:29:17] and wait for some time [19:29:39] yeah, let's see if it finishes before I go to bed :| [19:29:43] it is nasty that slaves can come back online and make data corruption [19:29:46] tabbycat: I can actually probably get this out in short order. I'll just do that while the flow changes are being checked. [19:29:56] (03CR) 10Chad: [C: 04-2] "For later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339236 (owner: 10Chad) [19:30:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:30:08] (03PS2) 10Andrew Bogott: Move archive-instances to python2, labmon1001 to liberty [puppet] - 10https://gerrit.wikimedia.org/r/339235 (https://phabricator.wikimedia.org/T158789) [19:30:10] (03PS1) 10Andrew Bogott: Mirantis repo: Make the apt proxy version-sensitive [puppet] - 10https://gerrit.wikimedia.org/r/339237 [19:30:27] jynus: you have the disable events command handy? [19:30:57] thcipriani, mwscript apparently doesn't exist on mwdebug1002, so I can't test. I think a similar issue came up before. There are no changes to files hit in web requests, except a comment change. [19:31:13] I'll test it later today (running the export, then importing first to test2wiki) [19:31:19] thcipriani, sorry, I mean Memcached doesn't exist. [19:31:22] (03Merged) 10jenkins-bot: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:31:31] (03CR) 10jenkins-bot: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:31:43] marostegui, set global event_Scheduler = off should work, but last time we had issues [19:31:50] thcipriani, https://phabricator.wikimedia.org/P4974 [19:32:00] maybe it takes some time to finish current ones or something [19:32:01] yes, thanks that is what I meant, if you had do to something else apart from that [19:32:07] matt_flaschen: ah, right. This is the php5 vs hhvm memcached thing. OK, I assume this is fine to sync-dir then? [19:32:20] thcipriani, yeah, it should not affect web requests in any way. [19:32:22] so just make sure twice they do not revive [19:32:23] well. "memcached" thing. several things. [19:32:25] ok, will do. [19:34:34] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:338632|Removing the "shellmanagers" group from Wikitech]] T158482 (duration: 00m 49s) [19:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:39] T158482: Remove shellmanagers group on wikitech - https://phabricator.wikimedia.org/T158482 [19:34:57] ^ tabbycat sync'd [19:35:39] thcipriani: argh, but forgot to remove the group from Add/Remove groups -- I can quickpatch it if you let me [19:36:01] tabbycat: sure [19:37:48] !log thcipriani@tin Synchronized php-1.29.0-wmf.13/extensions/Flow: SWAT: [[gerrit:339116|Import dump: support importing a board that exist in the farm]] T154830 (duration: 00m 56s) [19:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:53] T154830: Transwiki (within a farm) support for Flow dumps/imports - https://phabricator.wikimedia.org/T154830 [19:37:53] ^ matt_flaschen live everywhere [19:38:05] (03Draft2) 10MarcoAurelio: Follow-up I5edebdff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339238 [19:38:10] (03Draft1) 10MarcoAurelio: Follow-up I5edebdff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339238 [19:38:19] thcipriani: ^ [19:38:22] Thanks, thcipriani, we'll use it later today. [19:39:31] (03PS3) 10MarcoAurelio: Follow-up I5edebdff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339238 (https://phabricator.wikimedia.org/T158482) [19:39:36] (03PS2) 10Andrew Bogott: Mirantis repo: Make the apt proxy version-sensitive [puppet] - 10https://gerrit.wikimedia.org/r/339237 [19:39:38] (03PS3) 10Andrew Bogott: Move archive-instances to python2, labmon1001 to liberty [puppet] - 10https://gerrit.wikimedia.org/r/339235 (https://phabricator.wikimedia.org/T158789) [19:40:35] tabbycat: cool, ready to merge? [19:40:46] thcipriani: waiting for zuul [19:41:11] (03PS2) 10Gehel: Bump timeout to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/338473 (https://phabricator.wikimedia.org/T158184) (owner: 10Smalyshev) [19:41:21] bd808: now that shellmanagers is gone, maybe we can consider promoting Tim to 'crat so he can continue managing shell? [19:41:25] 06Operations: Ferm: leftovers on hosts were it was enabled and then removed - https://phabricator.wikimedia.org/T158798#3047720 (10Volans) [19:41:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339238 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:42:27] (03PS5) 10Aaron Schulz: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 [19:42:48] (03Merged) 10jenkins-bot: Follow-up I5edebdff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339238 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:43:00] (03CR) 10jenkins-bot: Follow-up I5edebdff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339238 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:43:12] (03CR) 10Gehel: [C: 032] Bump timeout to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/338473 (https://phabricator.wikimedia.org/T158184) (owner: 10Smalyshev) [19:44:00] (03CR) 10Dzahn: [C: 032] systemd: update class name in a fail() error [puppet] - 10https://gerrit.wikimedia.org/r/339174 (owner: 10Hashar) [19:44:02] tabbycat: seems like a good thing to me. scfc is super helpful in Labs [19:44:08] (03PS3) 10Dzahn: systemd: update class name in a fail() error [puppet] - 10https://gerrit.wikimedia.org/r/339174 (owner: 10Hashar) [19:45:44] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:339238|Finish removing "shellmanagers" on Wikitech]] T158482 (duration: 00m 40s) [19:45:49] ^ tabbycat sync'd! [19:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:50] T158482: Remove shellmanagers group on wikitech - https://phabricator.wikimedia.org/T158482 [19:45:55] godog: swat should be complete [19:45:59] thcipriani: shellmanagers is gone! [19:46:02] thanks! [19:46:20] thcipriani: fantastic, thanks [19:46:52] tabbycat: great! yw :) [19:46:58] !log roll-HUP rsyslog on mw1* to pick up DNS udplog change - T123728 [19:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:02] T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728 [19:47:13] (03PS3) 10Andrew Bogott: Mirantis repo: Make the apt proxy version-sensitive [puppet] - 10https://gerrit.wikimedia.org/r/339237 [19:47:14] (03PS4) 10Andrew Bogott: Move archive-instances to python2, labmon1001 to liberty [puppet] - 10https://gerrit.wikimedia.org/r/339235 (https://phabricator.wikimedia.org/T158789) [19:47:15] let's see if that does it [19:47:57] 06Operations, 10Phabricator, 10Phabricator (Upstream), 07Upstream: PHD ensuring umask goodness - https://phabricator.wikimedia.org/T91648#3047749 (10Aklapper) 05stalled>03Resolved Two years later, I'll just assume this has been fixed by the upstream patch 10 months ago. If not, feel free to reopen. [19:48:41] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [19:48:42] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3912823 keys, up 114 days 11 hours - replication_delay is 620 [19:49:31] (03CR) 10Andrew Bogott: [C: 032] Mirantis repo: Make the apt proxy version-sensitive [puppet] - 10https://gerrit.wikimedia.org/r/339237 (owner: 10Andrew Bogott) [19:49:40] (03PS4) 10Andrew Bogott: Mirantis repo: Make the apt proxy version-sensitive [puppet] - 10https://gerrit.wikimedia.org/r/339237 [19:49:41] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3910316 keys, up 114 days 11 hours - replication_delay is 0 [19:50:10] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3047754 (10Aklapper) [19:50:39] bd808: I've updated that patch for wikimediamessages removing Wikimedia Labs -> Wikitech [19:50:42] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3910248 keys, up 114 days 11 hours - replication_delay is 0 [19:50:53] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#1559747 (10Aklapper) This has been stuck now for 18 months. Does someone else have the same permissions in "Webmaster Tools"? [19:51:11] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:51:21] PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [19:52:11] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [19:52:11] RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set [19:52:48] (03PS5) 10Andrew Bogott: Move archive-instances to python2, labmon1001 to liberty [puppet] - 10https://gerrit.wikimedia.org/r/339235 (https://phabricator.wikimedia.org/T158789) [19:54:47] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3047787 (10Dzahn) >>! In T109810#3047754, @Aklapper wrote: > This has been stuck now for 18 months. Does someone else have the same permissions in "Webmaster... [19:55:07] (03CR) 10Andrew Bogott: [C: 032] Move archive-instances to python2, labmon1001 to liberty [puppet] - 10https://gerrit.wikimedia.org/r/339235 (https://phabricator.wikimedia.org/T158789) (owner: 10Andrew Bogott) [19:55:26] (03PS4) 10Dzahn: systemd: update class name in a fail() error [puppet] - 10https://gerrit.wikimedia.org/r/339174 (owner: 10Hashar) [19:56:43] !log deploying latest wdqs version [19:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:39] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3047812 (10RobH) a:05Jgreen>03RobH I'll handle key/csr generation and cert ordering. Then I'll pgp encrypt and email the key ove... [19:59:59] !log gehel@tin Started deploy [wdqs/wdqs@7768422]: (no justification provided) [20:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2000). [20:00:28] (03CR) 10Chad: [C: 032] Group1 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339236 (owner: 10Chad) [20:00:43] (03CR) 10Tim Landscheidt: Update links to Tool Labs apt repository (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/339201 (https://phabricator.wikimedia.org/T158383) (owner: 10Tim Landscheidt) [20:01:12] MediaWiki-🚆 [20:01:35] (03Merged) 10jenkins-bot: Group1 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339236 (owner: 10Chad) [20:01:44] (03CR) 10jenkins-bot: Group1 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339236 (owner: 10Chad) [20:02:03] !log gehel@tin Finished deploy [wdqs/wdqs@7768422]: (no justification provided) (duration: 02m 04s) [20:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:23] SMalyshev: ^ deployment done, tests still looking good [20:03:18] gehel: great [20:03:43] gehel: we may also want to do https://gerrit.wikimedia.org/r/#/c/243883/ maybe [20:03:57] (03CR) 10Smalyshev: [DNM] [WIP] Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [20:04:24] (03PS5) 10Smalyshev: Allow SPARQL endpoint to be queried via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) [20:04:25] !log demon@tin Started scap: group1 to wmf.13 [20:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:58] (03PS6) 10Smalyshev: Allow SPARQL endpoint to be queried via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) [20:05:00] (03PS7) 10Gehel: Allow SPARQL endpoint to be queried via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [20:05:14] SMalyshev: sounds like a plan [20:06:50] (03CR) 10Gehel: [C: 032] Allow SPARQL endpoint to be queried via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [20:07:21] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:11:23] (03PS6) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [20:13:04] (03CR) 10jerkins-bot: [V: 04-1] Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [20:13:20] 06Operations: Graphite-web version in our repo cannot be installed due to missing dependencies - https://phabricator.wikimedia.org/T158802#3047847 (10Andrew) [20:14:11] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:14:58] (03PS7) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [20:15:51] RECOVERY - HP RAID on dbstore2001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [20:16:12] (03CR) 10jerkins-bot: [V: 04-1] Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [20:17:19] (03PS8) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [20:19:14] (03PS1) 10Aaron Schulz: Enable $wgEnableWANCacheReaper for testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339245 [20:22:26] jouncebot: next [20:22:27] In 0 hour(s) and 37 minute(s): WDQS Deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2100) [20:22:27] In 0 hour(s) and 37 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2100) [20:23:30] (03PS1) 10Andrew Bogott: archive-instances: Use python-2 compatible os.makedirs [puppet] - 10https://gerrit.wikimedia.org/r/339246 (https://phabricator.wikimedia.org/T158789) [20:24:00] jouncebot: now [20:24:00] For the next 1 hour(s) and 35 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2000) [20:24:08] choo choo [20:26:16] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3047910 (10Gilles) You might not have access to the file contents when you generate a URL. What I'm saying is that you can't just drop that information... [20:26:20] (03PS2) 10Tim Landscheidt: Update links to Tool Labs apt repository [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/339201 (https://phabricator.wikimedia.org/T158383) [20:26:25] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3047913 (10mobrovac) @Pchelolo awesome! Thank you for testing. Could you also test the scenario where logs are sent to syslog via udp and then fill the partition again? So that we can... [20:27:38] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3047915 (10Gilles) Ah nevermind, it's only the md5 of some form of the filename, not the file contents? [20:28:04] (03PS2) 10Andrew Bogott: archive-instances: Use python-2 compatible os.makedirs [puppet] - 10https://gerrit.wikimedia.org/r/339246 (https://phabricator.wikimedia.org/T158789) [20:28:47] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3047919 (10Pchelolo) > @Pchelolo awesome! Thank you for testing. Could you also test the scenario where logs are sent to syslog via udp and then fill the partition again? So that we c... [20:30:05] !log demon@tin Finished scap: group1 to wmf.13 (duration: 25m 39s) [20:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:24] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3047940 (10Papaul) [20:32:26] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3047938 (10Papaul) 05Open>03Resolved License and firmware update complete. I switch back the IDRAC to Dedicated [20:32:41] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3047941 (10mobrovac) >>! In T112648#3047919, @Pchelolo wrote: > Anyway, I'm preparing a solution for file logging issue, I think that should be fixed anyway, \o/ This is splendid! T... [20:36:21] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:36:47] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3047949 (10dr0ptp4kt) I just wanted to acknowledge I just got a ping on //this// Phabricator task. I won't be able to investigate until next week, unfortunat... [20:37:06] (03PS3) 10Andrew Bogott: archive-instances: Use python-2 compatible os.makedirs [puppet] - 10https://gerrit.wikimedia.org/r/339246 (https://phabricator.wikimedia.org/T158789) [20:39:22] (03CR) 10Andrew Bogott: [C: 032] archive-instances: Use python-2 compatible os.makedirs [puppet] - 10https://gerrit.wikimedia.org/r/339246 (https://phabricator.wikimedia.org/T158789) (owner: 10Andrew Bogott) [20:42:49] 06Operations, 06Labs, 13Patch-For-Review: labmon1001: Graphite + Mirantis' Django don't play well together - https://phabricator.wikimedia.org/T158789#3047985 (10Andrew) 05Open>03Resolved a:03Andrew I fixed this by moving the archive-instance script to python2, which let us use Liberty OpenStack packag... [20:47:35] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10Krinkle) There seem to be a few issues that affect . * **Coal**: Requests fail with HTTP 500 (e.g. PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:56:31] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:56:41] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3909823 keys, up 114 days 12 hours - replication_delay is 0 [21:00:04] SMalyshev and gehel: Dear anthropoid, the time has come. Please deploy WDQS Deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2100). [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2100). Please do the needful. [21:00:52] Wdqs deployment already done [21:31:18] (03CR) 10Dzahn: [C: 04-1] "we made this work in labs by adding a second private IP on eth0 and making the phab ssh listen on 22 on that second IP. hence no special c" [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [21:31:28] (03Abandoned) 10Dzahn: Phabricator: Make ssh-phab port configurable [puppet] - 10https://gerrit.wikimedia.org/r/338294 (owner: 10Paladox) [21:34:36] (03CR) 10Dzahn: "yep, that's better. it now explains why you are moving to a template. but is there a reason to change the contents of the unit file too?" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [21:37:17] (03PS9) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [21:37:30] (03PS10) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [21:42:27] !log maxsem@tin Started deploy [kartotherian/deploy@81db48c]: Deploying https://gerrit.wikimedia.org/r/#/c/339093/ [21:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:12] 06Operations, 10Phabricator, 06Release-Engineering-Team: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434#3048219 (10Dzahn) puppet run LOOKS like it works, but ssh-phab service not started for real: ``` @phab2001:~# puppet age... [21:44:31] jouncebot: now [21:44:32] For the next 0 hour(s) and 15 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2100) [21:44:32] For the next 0 hour(s) and 15 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2000) [21:44:41] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Set up backend per-IP limits on varnish for WDQS - https://phabricator.wikimedia.org/T119917#3048221 (10Smalyshev) 05Open>03Resolved a:03Smalyshev I think this is done. [21:45:13] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Response times of Wikidata Query Service increasing - https://phabricator.wikimedia.org/T147130#3048228 (10Smalyshev) @Gehel anything left to do here? [21:47:19] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3048229 (10matmarex) For reference, this was originally reported here, before I made this Phab task: https://e... [21:47:38] 06Operations, 10Phabricator, 06Release-Engineering-Team: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434#3048230 (10Dzahn) 13:46 Process: 34620 ExecStart=/usr/sbin/sshd -D -f<%= @sshd_config %> (code=exited, status... [21:50:27] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3048266 (10matmarex) @Jdrewniak posted an incident report: * https://lists.wikimedia.org/pipermail/wikitech-l/... [21:54:01] (03PS1) 10Dzahn: phabricator: fix missing space in sshd-phab unit file [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) [21:54:29] (03CR) 10jerkins-bot: [V: 04-1] phabricator: fix missing space in sshd-phab unit file [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) (owner: 10Dzahn) [21:54:36] are you kidding jenkins [21:55:25] (03PS2) 10Dzahn: phabricator: fix missing space in sshd-phab unit file [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) [21:55:50] (03CR) 10jerkins-bot: [V: 04-1] phabricator: fix missing space in sshd-phab unit file [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) (owner: 10Dzahn) [21:57:06] (03Draft1) 10Paladox: phd service config needs to be a template [puppet] - 10https://gerrit.wikimedia.org/r/339311 [21:57:11] (03PS2) 10Paladox: phd service config needs to be a template [puppet] - 10https://gerrit.wikimedia.org/r/339311 [21:57:32] !log maxsem@tin Finished deploy [kartotherian/deploy@81db48c]: Deploying https://gerrit.wikimedia.org/r/#/c/339093/ (duration: 15m 05s) [21:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:22] (03CR) 10Paladox: "the file was moved to modules/phabricator/templates/sshd-phab.service.erb" [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) (owner: 10Dzahn) [21:58:35] (03PS3) 10Dzahn: phabricator: fix missing space in sshd-phab unit file [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) [21:58:55] (03CR) 10Paladox: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) (owner: 10Dzahn) [21:59:11] legoktm: do you think you can review this sooner https://gerrit.wikimedia.org/r/#/c/338701/ so I can get it pushed through SWAT? [21:59:46] (03CR) 10Dzahn: [C: 04-1] "you are already doing this in https://gerrit.wikimedia.org/r/#/c/338022/ except one time you call it "phd.service.erb" and the other time" [puppet] - 10https://gerrit.wikimedia.org/r/339311 (owner: 10Paladox) [22:00:04] matt_flaschen: Respected human, time to deploy Flow script (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170222T2200). Please do the needful. [22:00:09] (03CR) 10Dzahn: [C: 04-1] "i was talking about "ssh-phab" service" [puppet] - 10https://gerrit.wikimedia.org/r/339311 (owner: 10Paladox) [22:00:37] (03CR) 10Paladox: "> you are already doing this in https://gerrit.wikimedia.org/r/#/c/338022/" [puppet] - 10https://gerrit.wikimedia.org/r/339311 (owner: 10Paladox) [22:01:07] (03CR) 10Paladox: "> i was talking about "ssh-phab" service" [puppet] - 10https://gerrit.wikimedia.org/r/339311 (owner: 10Paladox) [22:01:10] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3048319 (10RobH) [22:01:12] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3048315 (10RobH) 05Resolved>03Open a:05Papaul>03fgiunchedi Re-opening, because this was just part of the issues on this host. Now that the drac is back online, the host still fails the followin... [22:01:18] (03CR) 10Dzahn: [C: 04-1] "hence my comment right above saying " except one time you call it "phd.service.erb" and the other time "phd.systemd.erb""" [puppet] - 10https://gerrit.wikimedia.org/r/339311 (owner: 10Paladox) [22:02:17] (03Abandoned) 10Paladox: phd service config needs to be a template [puppet] - 10https://gerrit.wikimedia.org/r/339311 (owner: 10Paladox) [22:03:07] (03CR) 10Dzahn: "13:54 you have 2 services" [puppet] - 10https://gerrit.wikimedia.org/r/339311 (owner: 10Paladox) [22:03:33] (03PS11) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [22:04:05] Here [22:04:33] 06Operations, 10RESTBase, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3048336 (10Pchelolo) So, I've tested the syslog-over-udp solution too, and as we've expected, the service doesn't even notice the filled-up partition. Also, create a PR that's fixing... [22:05:41] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3048354 (10mobrovac) [22:06:23] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [22:06:38] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3048370 (10mobrovac) @EBjune , @K4-713 as @MaxSem's manager(s), we need your approval for the request to be granted. [22:07:25] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3048372 (10mobrovac) [22:08:34] (03CR) 10Dzahn: [C: 032] phabricator: fix missing space in sshd-phab unit file [puppet] - 10https://gerrit.wikimedia.org/r/339312 (https://phabricator.wikimedia.org/T158434) (owner: 10Dzahn) [22:10:26] (03PS12) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [22:11:11] (03PS3) 10Krinkle: redis: declare /var/run/redis [puppet] - 10https://gerrit.wikimedia.org/r/268598 (owner: 10Ori.livneh) [22:13:58] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3048399 (10RobH) [22:16:29] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [22:17:21] (03PS5) 10Dzahn: systemd: update class name in a fail() error [puppet] - 10https://gerrit.wikimedia.org/r/339174 (owner: 10Hashar) [22:18:55] paladox: maybe you should run the tests locally before sending patch to gerrit ? :] [22:19:13] (03PS13) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [22:20:41] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [22:26:04] 06Operations, 10RESTBase, 10service-runner, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3048432 (10mobrovac) I think that with service-runner v2.2.3 we are solid to go for normal file logging. [22:27:25] (03PS14) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [22:27:56] (03PS15) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [22:30:01] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:01] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:32:40] ^ i just ran it [22:33:31] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3048443 (10mobrovac) [22:34:32] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434#3048444 (10Dzahn) works now on phab2001 > > > Notice: /Stage[main]/Phabricator::Vcs/Service[ssh-... [22:36:38] (03CR) 10Dzahn: [C: 04-1] "Error: Could not find template 'phabricator/initscripts/phd.service.erb' at /mnt/jenkins-workspace/puppet-compiler/5544/change/src/modules" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [22:37:08] (03PS16) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [22:38:48] (03PS17) 10Paladox: Phabricator: Add systemd template [puppet] - 10https://gerrit.wikimedia.org/r/338022 [22:42:50] (03CR) 10Dduvall: [C: 031] rubocop disable Style/PercentLiteralDelimiters [puppet] - 10https://gerrit.wikimedia.org/r/339175 (owner: 10Hashar) [22:44:03] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5546/" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox) [22:44:11] mutante thanks :) [22:45:08] (03PS6) 10EddieGP: New throttle rule, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [22:46:00] paladox: thanks too, now you can follow-up with the change to use base::service [22:46:12] Yep, will do that tomrror. [22:46:18] yep [22:46:36] !log update RESTBase to 3340714f0: staging [22:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:46] (03PS1) 10Tim Landscheidt: graphite: Unquote octal number in parameter default [puppet] - 10https://gerrit.wikimedia.org/r/339321 [22:48:35] (03CR) 10Dzahn: [C: 032] rubocop disable Style/PercentLiteralDelimiters [puppet] - 10https://gerrit.wikimedia.org/r/339175 (owner: 10Hashar) [22:48:50] (03PS2) 10Dzahn: rubocop disable Style/PercentLiteralDelimiters [puppet] - 10https://gerrit.wikimedia.org/r/339175 (owner: 10Hashar) [22:50:52] !log update RESTBase to 3340714f0: canary on restbase1007 [22:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:17] !log stopping dbstore1001 mariadb in preparation for tomorrow's reimage T153768 [22:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:23] T153768: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768 [22:53:15] !log update RESTBase to 3340714f0 [22:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:06] jouncebot: next [22:56:07] In 1 hour(s) and 3 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T0000) [23:00:01] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:37] (03Abandoned) 10Dzahn: ssl: delete ecc-uni.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/334209 (owner: 10Dzahn) [23:00:41] (03Abandoned) 10Dzahn: ssl: delete uni.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/334210 (owner: 10Dzahn) [23:03:31] (03CR) 10Dzahn: "could (tool)-labs people please review this" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [23:05:24] (03CR) 10Paladox: [C: 031] Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [23:07:01] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 61283 MB (12% inode=99%) [23:07:14] (03CR) 10Paladox: Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [23:09:32] (03PS2) 10Andrew Bogott: graphite: Unquote octal number in parameter default [puppet] - 10https://gerrit.wikimedia.org/r/339321 (owner: 10Tim Landscheidt) [23:09:40] (03CR) 10Andrew Bogott: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/339321 (owner: 10Tim Landscheidt) [23:12:01] (03CR) 10Dzahn: [C: 031] "sounds good, but not going to merge changes in kafka submodule." [puppet/kafka] - 10https://gerrit.wikimedia.org/r/338385 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [23:12:38] (03CR) 10Dzahn: [C: 031] "sounds good, but i don't have any relation to the wikimetrics submodule" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/338387 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [23:14:29] (03CR) 10Dzahn: "i'll leave this to the editors of check_graphite" [puppet] - 10https://gerrit.wikimedia.org/r/338095 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [23:15:43] (03CR) 10Dzahn: [C: 031] "add me again after a maintenance window has been picked" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:16:09] (03CR) 10Dzahn: "abstain" [puppet] - 10https://gerrit.wikimedia.org/r/331602 (owner: 10Alexandros Kosiaris) [23:16:54] Flow script window is complete [23:17:07] !log Migrated https://meta.wikimedia.org/wiki/Research_talk:ORES_paper to https://www.mediawiki.org/wiki/Talk:ORES/Paper using extensions/Flow/maintenance/dumpBackup.php and importDump.php [23:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:50] !log Exported https://meta.wikimedia.org/wiki/Talk:Flow/Developer_test_page to https://meta.wikimedia.org/wiki/Talk:Flow/Developer_test_page/Wikitext using extensions/Flow/maintenance/convertToText.php [23:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:21] (03CR) 10Dzahn: [C: 04-1] "feel free re-add me after it's not WIP anymore" [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [23:20:01] RECOVERY - Disk space on elastic1024 is OK: DISK OK [23:21:15] (03CR) 10Dzahn: "@Juniorsys are you planning to amend to this?" [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:23:35] (03CR) 10Dzahn: "@Aklapper want and like this?" [puppet] - 10https://gerrit.wikimedia.org/r/317990 (owner: 10Alex Monk) [23:24:47] (03CR) 10Dzahn: [C: 031] Get rid of old beta_sites class now just containing a load of ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/322604 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [23:24:49] (03CR) 10Dzahn: [C: 031] "added reviewers" [puppet] - 10https://gerrit.wikimedia.org/r/322604 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [23:25:34] (03CR) 10Chad: [C: 031] "About damn time" [puppet] - 10https://gerrit.wikimedia.org/r/322604 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [23:27:44] (03CR) 10Dzahn: [C: 032] Fix programdashboard hieradata [puppet] - 10https://gerrit.wikimedia.org/r/274572 (owner: 10Dduvall) [23:28:01] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:28:22] (03CR) 10Dzahn: [C: 032] "lgtm and would have merged but has a dependency on something related to access" [puppet] - 10https://gerrit.wikimedia.org/r/274572 (owner: 10Dduvall) [23:29:09] RainbowSprinkles: I hope you realise that's unlikely to get done now. [23:35:36] Krenair: Because people are afraid of apache? :p [23:36:19] Yes, and no one is going to push for it anymore [23:37:59] Not specifically anyway [23:38:30] There might still be people who complain about such differences in general but aren't interested in any actual progress [23:40:59] (03PS1) 10Andrew Bogott: mwopenstackclients: Retry for failed connection. [puppet] - 10https://gerrit.wikimedia.org/r/339323 [23:41:38] Krenair: That last one is a no-op in prod, we should be able to get it out [23:41:41] Puppetswat! [23:41:44] * RainbowSprinkles will push [23:41:46] Sure [23:41:48] I'm a pusher! [23:41:53] The last one was never going to be the problem. [23:42:04] Read up the dependency list [23:42:20] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: Retry for failed connection. [puppet] - 10https://gerrit.wikimedia.org/r/339323 (owner: 10Andrew Bogott) [23:42:22] Oh shitbaskets [23:42:28] indeed [23:42:29] I thought those had merged and this was just the last bit [23:43:43] no [23:54:18] (03PS1) 10DatGuy: Update logo for bswiki (Bosnian Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) [23:55:28] what's the difference in the binary of the two? [23:56:13] DatGuy [23:56:13] https://phabricator.wikimedia.org/source/mediawiki-config/browse/HEAD/static/images/project-logos/bswiki.png;refs/changes/26/339326/1%5E1 [23:56:17] https://phabricator.wikimedia.org/source/mediawiki-config/browse/HEAD/static/images/project-logos/bswiki.png;refs/changes/26/339326/1 [23:56:34] moves the spacing of the text. [23:56:43] I know, but binary always confuses me [23:57:44] yep, you wont be able to tell the difference in the binary [23:59:23] you gotta make a gif to see that [23:59:35] it's so minimal