[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T0000). Please do the needful. [00:00:45] (03Merged) 10jenkins-bot: Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) (owner: 10Dereckson) [00:02:02] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Import sources on hi.wiktionary (Task T132417, [[Gerrit:282843]]) (duration: 00m 26s) [00:02:03] T132417: Add Import Sources for hi.wiktionary - https://phabricator.wikimedia.org/T132417 [00:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:15] I've a problem, we've an undeployed change. [00:04:42] Oh, all is fine, it's yours MaxSem. [00:04:45] thanks [00:06:50] MaxSem: let me know when you're done, I have a few patches to sync [00:07:14] MaxSem: hey wait, I'm syncing MatmaRex one [00:07:30] !log rebooting bast4001 to PXE, no active users, no screens [00:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:38] :P [00:11:51] !log dereckson@tin Synchronized php-1.27.0-wmf.21/extensions/AbuseFilter/Views/AbuseFilterViewEdit.php: Fixes to filter profiling ([[Gerrit:283333]], 1/3) (duration: 00m 26s) [00:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:25] !log dereckson@tin Synchronized php-1.27.0-wmf.21/extensions/AbuseFilter/Views/AbuseFilterViewList.php: Fixes to filter profiling ([[Gerrit:283333]], 1/3) (duration: 00m 26s) [00:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:14] !log dereckson@tin Synchronized php-1.27.0-wmf.21/extensions/AbuseFilter/AbuseFilter.class.php: Fixes to filter profiling ([[Gerrit:283333]], 3/3) (duration: 00m 26s) [00:13:16] MatmaRex: okay, change deployed, you can test something before the config change? [00:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:55] Dereckson: this should've been mostly a no-op, other than maybe losing some cached data after the switch [00:14:11] Let's go for the config one so. [00:14:32] 06Operations, 13Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2205604 (10Dzahn) a:03Dzahn [00:14:55] (03CR) 10Dereckson: [C: 032] "Backport done in I08316c6a3192bd69248cf5ab5a3ed8185341c313." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [00:15:07] We'll probably have to rebase it. [00:15:47] (03PS4) 10Dereckson: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [00:15:54] (03CR) 10Dereckson: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [00:17:48] (03PS1) 10BBlack: Switch mobile hostnames to text IP [dns] - 10https://gerrit.wikimedia.org/r/283364 (https://phabricator.wikimedia.org/T124482) [00:18:41] Zuul ... [00:19:33] 06Operations, 10Traffic, 06Zero, 13Patch-For-Review: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2205612 (10BBlack) https://gerrit.wikimedia.org/r/283364 above does the functional user-facing change. If it's successful without issue... [00:19:44] (03PS1) 10Dzahn: admin: add empty group of swift-roots [puppet] - 10https://gerrit.wikimedia.org/r/283365 (https://phabricator.wikimedia.org/T130910) [00:19:52] bblack: \o/ [00:19:54] MatmaRex: Zuul doesn't take https://gerrit.wikimedia.org/r/#/c/282806/ [00:19:58] that's awesome! [00:20:18] doesn't take how? [00:20:18] oh [00:20:27] Dereckson: try removing the C+2, then adding it again [00:20:35] (03CR) 10Dereckson: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [00:20:41] bblack: nice! [00:20:44] (03CR) 10Dereckson: [C: 032] Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [00:20:56] there's some weird bug where rebasing a change with C+2 doesn't trigger gate-and-submit, even though it keeps the C+2 [00:21:25] oh [00:21:54] we've clean history hi.wikt > your change and tests ok for your change [00:22:04] So I guess, we could submit it directly. [00:22:14] (03Merged) 10jenkins-bot: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [00:22:25] (03CR) 10Dzahn: [C: 032] "https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/TechOps-2016-04-11#Access_Requests" [puppet] - 10https://gerrit.wikimedia.org/r/283365 (https://phabricator.wikimedia.org/T130910) (owner: 10Dzahn) [00:22:30] Ah no need, perfect. [00:23:07] MatmaRex: Only +2 triggers gate.. rebase is just rebase and score gets carried along so that -1 and +1 is preserved. Automatically going into the gate is probably unexpected in most cases. [00:23:31] Krinkle: makes sense [00:23:32] and consistently doesn't happen. It's not a bug at the moment, it's intentional. But that can be debated. [00:23:36] !log dereckson@tin Synchronized wmf-config/abusefilter.php: Enable $wgAbuseFilterProfile for commonswiki ([[Gerrit:282806]], Task T132200) (duration: 00m 28s) [00:23:37] T132200: Enable $wgAbuseFilterProfile on Commons for a few days - https://phabricator.wikimedia.org/T132200 [00:23:42] MatmaRex: you can test [00:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:52] Krinkle: i dunno, if a change has C+2, i'd expect jenkins to do its damnedest to get it merged as soon as possible. but no matter. :) [00:24:05] for code repo, perhaps [00:24:15] for wmf branches, where deployment is expected directly :/ [00:24:17] MatmaRex: It can cause code to merge unexpectedly and by unauthorised people. [00:24:45] Dereckson: thanks, all seems okay [00:24:47] E.g. I +2 and it fails for some reason (Jenkins outage, dependency missing, whatever) then a week goes by and a random person does a rebase. that should not make it land. [00:25:01] Especially on a tuesday morning where master merges go to prod in under an hour. [00:25:30] especially not with our current code review and CI practices. [00:25:46] fair enough. [00:25:48] But yeah, if we're all a lot stricter and with better processes, maybe one day that will be nice. [00:26:15] MaxSem: okay you can deploy your TextExtracts change [00:26:22] Thanks for testing MatmaRex. [00:27:09] Dereckson: hmm, the filter stats might've gotten skewed in the few minutes when we were counting filter hits with the new cache thingy, and not saving the profiling data. but that's not a problem. i need to write a maintenance script to clear these :D [00:27:28] !log maxsem@tin Synchronized php-1.27.0-wmf.21/extensions/TextExtracts/: https://gerrit.wikimedia.org/r/#/c/282640/ (duration: 00m 26s) [00:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:51] (the couple hundred actions it's off by will be a drop in the bucket in a few hours) [00:28:24] (03PS1) 10Dzahn: admin/swift: add swift-roots in hiera/common [puppet] - 10https://gerrit.wikimedia.org/r/283366 (https://phabricator.wikimedia.org/T130910) [00:28:36] !log maxsem@tin Synchronized php-1.27.0-wmf.20/extensions/TextExtracts/: https://gerrit.wikimedia.org/r/#/c/282640/ (duration: 00m 26s) [00:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:01] (03PS3) 10Ori.livneh: Force $_SERVER['SERVER_SOFTWARE'] to be "Apache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283341 (https://phabricator.wikimedia.org/T132612) [00:29:06] (03CR) 10Ori.livneh: [C: 032] Force $_SERVER['SERVER_SOFTWARE'] to be "Apache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283341 (https://phabricator.wikimedia.org/T132612) (owner: 10Ori.livneh) [00:30:13] (03Merged) 10jenkins-bot: Force $_SERVER['SERVER_SOFTWARE'] to be "Apache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283341 (https://phabricator.wikimedia.org/T132612) (owner: 10Ori.livneh) [00:30:46] (03CR) 10Dzahn: [C: 032] admin/swift: add swift-roots in hiera/common [puppet] - 10https://gerrit.wikimedia.org/r/283366 (https://phabricator.wikimedia.org/T130910) (owner: 10Dzahn) [00:32:37] hmm, can't ssh to bast4001 [00:32:46] mutante: which host do you use as bation? [00:33:13] ori: i'm trying to upgrade that to jessie right now [00:33:13] *bastion [00:33:17] but TFTP across DC :p [00:33:18] ah, ok [00:33:21] and it's like super slow [00:33:27] is bast1001 back? [00:33:30] yes it is [00:33:32] and jessie [00:33:34] cool, i'll use that [00:33:36] thanks [00:33:39] :) yw [00:34:34] ori: https://phabricator.wikimedia.org/T123721#2204676 [00:34:47] ah [00:34:49] so you can trust the host key [00:38:09] !log ori@tin Synchronized wmf-config/CommonSettings.php: I08cfeca7c: Force ['SERVER_SOFTWARE'] to be"Apache" (duration: 00m 26s) [00:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:35] mutante: faster than eqiad-esams? [00:38:39] I remember you aborted that one [00:38:49] (03CR) 10Ori.livneh: [C: 031] Set descriptionCacheExpiry for Commons repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 (owner: 10Aaron Schulz) [00:39:38] Krinkle: not really :/ [00:39:48] but in esams at least i had other servers to use [00:39:56] in ulsfo i wouldnt know what ..hrmmm [00:40:17] and others say they did this before and it took like 15 minutes.. hrmmm [00:42:46] !log ori@tin Synchronized php-1.27.0-wmf.21/includes/libs/IEUrlExtension.php: Ie9799f5ea: Revert Hack IEUrlExtension::haveUndecodedRequestUri() to always return true (duration: 00m 32s) [00:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:43:07] it doesnt look good, nothing happens [00:43:17] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/libs/IEUrlExtension.php: Ie9799f5ea: Revert Hack IEUrlExtension::haveUndecodedRequestUri() to always return true (duration: 00m 30s) [00:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:26] (03PS1) 10Dzahn: admin: add gilles to swift-roots [puppet] - 10https://gerrit.wikimedia.org/r/283369 (https://phabricator.wikimedia.org/T130910) [00:45:40] PROBLEM - RAID on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:45:51] PROBLEM - Labs LDAP on serpens is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:20] PROBLEM - salt-minion processes on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:30] PROBLEM - Check size of conntrack table on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:30] PROBLEM - dhclient process on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:31] PROBLEM - configured eth on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:51] PROBLEM - DPKG on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:52] PROBLEM - Disk space on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:11] PROBLEM - puppet last run on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:34] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2205680 (10MZMcBride) >>! In T132407#2198891, @Milimetric wrote: > There's some debate about this. We haven't used data.wikimedia.org in the past because of the possible confusion with w... [00:48:49] (03CR) 10Dzahn: [C: 032] "gilles is already a shell user (deployment), it has been brought up in meeting got approval, waiting period has passed etc.." [puppet] - 10https://gerrit.wikimedia.org/r/283369 (https://phabricator.wikimedia.org/T130910) (owner: 10Dzahn) [00:50:05] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2205684 (10Dzahn) a:03Dzahn [00:51:20] hmmm, no user being created yet on a swift box [00:51:34] hieradata common/swift.yaml i expected to just do it [00:53:11] and the install also wont work.. meh meh [00:55:04] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2205692 (10Dzahn) Hi all, so i created a new group swift-roots, put that group in hiera common/swift.yaml, then put gilles into that group, and expected this to be re... [01:06:33] !log bast4001 back up unchanged - tftp wouldnt work across DC, probably network ACLs [01:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:14:58] (03CR) 10Aaron Schulz: [C: 04-1] "Blocked on https://gerrit.wikimedia.org/r/#/c/283238/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 (owner: 10Aaron Schulz) [01:15:32] (03CR) 10Dzahn: [C: 04-2] "reinstall not done yet" [puppet] - 10https://gerrit.wikimedia.org/r/283361 (https://phabricator.wikimedia.org/T123674) (owner: 10Dzahn) [01:16:23] AaronSchulz: doh, I reviewed that and thought I merged it. [01:23:02] PROBLEM - NTP on serpens is CRITICAL: NTP CRITICAL: No response from NTP server [01:26:11] RECOVERY - salt-minion processes on serpens is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:26:17] !log rebooting unresponsive serpens.wm.org [01:26:21] RECOVERY - dhclient process on serpens is OK: PROCS OK: 0 processes with command name dhclient [01:26:21] RECOVERY - Check size of conntrack table on serpens is OK: OK: nf_conntrack is 0 % full [01:26:21] RECOVERY - configured eth on serpens is OK: OK - interfaces up [01:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:26:50] RECOVERY - DPKG on serpens is OK: All packages OK [01:26:52] RECOVERY - Disk space on serpens is OK: DISK OK [01:27:11] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [01:27:22] RECOVERY - RAID on serpens is OK: OK: no RAID installed [01:27:32] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.144 seconds response time [01:29:52] !log mw1211 - restart hhvm [01:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:30:31] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 66712 bytes in 0.190 second response time [01:31:40] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.048 second response time [01:34:05] (03PS4) 10Dereckson: Add maps-cluster referer rule for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) [01:36:02] (03PS1) 10Dzahn: swift/admin: put admin groups in role/common/ files [puppet] - 10https://gerrit.wikimedia.org/r/283371 (https://phabricator.wikimedia.org/T130910) [01:37:56] (03PS2) 10Dzahn: swift/admin: put admin groups in role/common/ files [puppet] - 10https://gerrit.wikimedia.org/r/283371 (https://phabricator.wikimedia.org/T130910) [01:38:32] (03PS3) 10Dzahn: swift/admin: put admin group in role/common/swift.yaml [puppet] - 10https://gerrit.wikimedia.org/r/283371 (https://phabricator.wikimedia.org/T130910) [01:45:11] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: puppet fail [01:45:51] RECOVERY - NTP on serpens is OK: NTP OK: Offset -0.001129031181 secs [01:55:21] (03CR) 10Dzahn: [C: 032] swift/admin: put admin group in role/common/swift.yaml [puppet] - 10https://gerrit.wikimedia.org/r/283371 (https://phabricator.wikimedia.org/T130910) (owner: 10Dzahn) [02:02:57] (03PS1) 10Dzahn: swift/admin: set admin group per dc and role [puppet] - 10https://gerrit.wikimedia.org/r/283372 (https://phabricator.wikimedia.org/T130910) [02:06:31] (03CR) 10Dzahn: [C: 032] swift/admin: set admin group per dc and role [puppet] - 10https://gerrit.wikimedia.org/r/283372 (https://phabricator.wikimedia.org/T130910) (owner: 10Dzahn) [02:11:51] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:13:06] (03PS1) 10Dzahn: swift: include admin in role classes [puppet] - 10https://gerrit.wikimedia.org/r/283373 [02:22:05] (03CR) 10Legoktm: "Did this really need 37 separate bugs?" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [02:25:42] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 11m 13s) [02:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:40] !log backfilled missing SAL entries from 2016-04-13T10:56Z to 2016-04-13T20:20Z to https://tools.wmflabs.org/sal/production [02:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:39] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 12m 26s) [02:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:53:59] (03CR) 10Ori.livneh: Convert mwgrep to use regexp by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [03:02:03] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Apr 14 03:02:03 UTC 2016 (duration 9m 25s) [03:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:43:05] !log mwgrep deleteEqualMessages.php --wiki kawikiquote [03:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:38:31] (03PS6) 10Madhuvishy: [WIP] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 [05:39:01] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2205972 (10Dzahn) @fgiunchedi @Andrew i thought this would be just resolved by now. wanna take a look what technical part i'm missing? I made a new group, i tried to p... [05:40:49] (03CR) 10Yuvipanda: "Minor fixes." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/277189 (owner: 10Madhuvishy) [05:44:55] (03PS7) 10Madhuvishy: ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 [05:46:23] (03CR) 10jenkins-bot: [V: 04-1] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 (owner: 10Madhuvishy) [05:54:00] (03PS8) 10Madhuvishy: ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 [05:55:04] (03CR) 10jenkins-bot: [V: 04-1] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 (owner: 10Madhuvishy) [05:55:12] gah jenkins why [05:56:45] (03PS9) 10Yuvipanda: ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 (owner: 10Madhuvishy) [06:00:00] (03CR) 10Yuvipanda: [C: 032] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 (owner: 10Madhuvishy) [06:04:11] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2205978 (10Volans) @Cmjohnson did you had a chance to check the rack for a possible hot spot in the DC? In case you think that the problem is the thermal paste, probably the safer opt... [06:09:16] (03PS6) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [06:20:43] 06Operations, 10DBA, 10hardware-requests: Decomission db1010 - https://phabricator.wikimedia.org/T129395#2205996 (10Volans) I chat with @RobH the same day on IRC, just forgot to update the task here: we decided to wait for @jcrespo [06:27:17] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T129452#2206000 (10Peachey88) [06:30:10] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:51] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: puppet fail [06:31:10] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] 06Operations, 10DBA, 10hardware-requests: Decommission db1010 - https://phabricator.wikimedia.org/T129395#2206007 (10MZMcBride) [06:31:30] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:51] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:01] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:22] (03PS1) 10Madhuvishy: ifttt: Pass callable option to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/283382 [06:44:44] (03CR) 10Yuvipanda: [C: 032] ifttt: Pass callable option to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/283382 (owner: 10Madhuvishy) [06:56:41] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:40] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] (03PS4) 10Muehlenhoff: Don't use package-> latest for apt-transport-https [puppet] - 10https://gerrit.wikimedia.org/r/282941 [06:58:30] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Don't use package-> latest for apt-transport-https [puppet] - 10https://gerrit.wikimedia.org/r/282941 (owner: 10Muehlenhoff) [07:08:48] (03PS7) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [07:24:23] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2206045 (10elukey) [07:24:25] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2206044 (10elukey) 05Open>03Resolved [07:28:11] 06Operations: ytterbium and strontium daily cronspam - https://phabricator.wikimedia.org/T132661#2206046 (10elukey) [07:28:31] 06Operations: ytterbium and strontium daily cronspam - https://phabricator.wikimedia.org/T132661#2206046 (10elukey) p:05Normal>03Low [07:30:13] 06Operations: ytterbium, neon and strontium daily cronspam - https://phabricator.wikimedia.org/T132661#2206060 (10elukey) [07:40:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/3: down - Peering: Equinix Ashburn Exchange {#2648} [10Gbps]BR [07:42:47] ---^ should be related to the maintenance window [07:42:58] (Equinix) [07:45:36] mmmm torrus looks a bit weird https://torrus.wikimedia.org/torrus/Network?path=/Core_routers/cr2-eqiad.wikimedia.org/Interface_Counters/&view=overview-subleaves-html&OVS=traffic [07:45:40] (no metrics) [07:46:47] * elukey looks at librenms [07:51:20] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [08:13:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Generally ok, but some code polish is needed and the handling of errors/failures must be redone completely." (037 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) (owner: 10Elukey) [08:21:41] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: Puppet has 1 failures [08:28:20] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:32:18] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 3 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2206219 (10Jdlrobson) So here are some facts * On pages impacted by the bug, when queried ParserOutput->getTOCEnabled returned f... [08:32:25] ^ labmon is fine, ongoing install, but the slow I/O makes this show up [08:38:01] RECOVERY - DPKG on labmon1001 is OK: All packages OK [08:39:22] PROBLEM - salt-minion processes on labsdb1003 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:41:59] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2206223 (10MoritzMuehlenhoff) a:03Papaul [08:42:51] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2193908 (10MoritzMuehlenhoff) That box didn't come back up after a kernel reboot and from what I remember the serial console was dead. Papaul, can you please check the... [08:48:50] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:00:23] !log fixed stray salt minion processes on labsdb1003 (apparently caused by stale pidfile) [09:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:00] RECOVERY - salt-minion processes on labsdb1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:07:58] (03CR) 10Elukey: Add complete submodule support to the puppet compiler. (035 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) (owner: 10Elukey) [09:09:12] (03PS8) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [09:14:19] (03PS1) 10Jalexander: Change voteWiki to fa language temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283400 (https://phabricator.wikimedia.org/T132667) [09:32:07] (03PS1) 10Muehlenhoff: Add CVE IDs which was assigned after the respective stable patches were merged [debs/linux] - 10https://gerrit.wikimedia.org/r/283402 [09:32:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add CVE IDs which was assigned after the respective stable patches were merged [debs/linux] - 10https://gerrit.wikimedia.org/r/283402 (owner: 10Muehlenhoff) [09:40:31] (03PS1) 10Giuseppe Lavagetto: mediawiki: Use forward_syslog for codfw too [puppet] - 10https://gerrit.wikimedia.org/r/283404 [09:40:33] (03PS1) 10Giuseppe Lavagetto: confd: removing redundant hiera data [puppet] - 10https://gerrit.wikimedia.org/r/283405 [09:46:40] (03PS1) 10Muehlenhoff: Use yubiauth::server role for auth2001 [puppet] - 10https://gerrit.wikimedia.org/r/283406 [09:47:22] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM:" [puppet] - 10https://gerrit.wikimedia.org/r/283404 (owner: 10Giuseppe Lavagetto) [09:53:03] (03CR) 10Alexandros Kosiaris: [C: 032] Add maps-cluster referer rule for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [09:53:31] (03PS5) 10Alexandros Kosiaris: Add maps-cluster referer rule for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [09:53:36] (03CR) 10Alexandros Kosiaris: [V: 032] Add maps-cluster referer rule for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [10:06:37] (03PS2) 10Muehlenhoff: Use yubiauth::server role for auth2001 [puppet] - 10https://gerrit.wikimedia.org/r/283406 [10:08:20] (03CR) 10Giuseppe Lavagetto: [C: 031] Disable connection tracking for redis/jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/283167 (owner: 10Muehlenhoff) [10:17:14] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:04] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 66766 bytes in 0.254 second response time [10:26:40] (03PS1) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [10:27:26] 06Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2206320 (10mark) [10:27:28] 06Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2206322 (10mark) [10:30:10] (03PS2) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [10:35:17] (03PS2) 10Giuseppe Lavagetto: confd: removing redundant hiera data [puppet] - 10https://gerrit.wikimedia.org/r/283405 [10:37:34] (03CR) 10Giuseppe Lavagetto: [C: 032] confd: removing redundant hiera data [puppet] - 10https://gerrit.wikimedia.org/r/283405 (owner: 10Giuseppe Lavagetto) [10:37:45] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2206343 (10Gehel) [10:52:35] PROBLEM - confd service on cp1008 is CRITICAL: CRITICAL - Expecting active but unit confd is activating [10:56:28] jouncebot_: next [10:56:28] In 4 hour(s) and 3 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1500) [10:59:08] (03PS1) 10BBlack: Kill 5xx thresh for non-existent cache_parsoid [puppet] - 10https://gerrit.wikimedia.org/r/283412 [11:09:46] (03PS2) 10Muehlenhoff: Disable connection tracking for redis/jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/283167 [11:09:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable connection tracking for redis/jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/283167 (owner: 10Muehlenhoff) [11:10:53] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2206392 (10mforns) @Ottomata We might want to change this to https then: https://github.com/wikimedia/analytics-dashiki/blob/master/s... [11:19:21] (03CR) 10Giuseppe Lavagetto: [C: 031] Kill 5xx thresh for non-existent cache_parsoid [puppet] - 10https://gerrit.wikimedia.org/r/283412 (owner: 10BBlack) [11:20:32] (03PS1) 10Dereckson: Add bio.acousti.ca to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283414 (https://phabricator.wikimedia.org/T132140) [11:21:14] (03PS3) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [11:25:09] (03PS3) 10Giuseppe Lavagetto: redis::monitoring::instance: vary description based on instance name [puppet] - 10https://gerrit.wikimedia.org/r/282949 [11:25:43] (03CR) 10Giuseppe Lavagetto: [C: 032] redis::monitoring::instance: vary description based on instance name [puppet] - 10https://gerrit.wikimedia.org/r/282949 (owner: 10Giuseppe Lavagetto) [11:26:15] (03CR) 10Giuseppe Lavagetto: [V: 032] redis::monitoring::instance: vary description based on instance name [puppet] - 10https://gerrit.wikimedia.org/r/282949 (owner: 10Giuseppe Lavagetto) [11:26:31] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2206419 (10elukey) a:05elukey>03None [11:28:30] (03PS1) 10Muehlenhoff: Enable base::firewall on rdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/283415 [11:29:12] (03PS2) 10BBlack: Kill 5xx thresh for non-existent cache_parsoid [puppet] - 10https://gerrit.wikimedia.org/r/283412 [11:29:19] (03CR) 10BBlack: [C: 032 V: 032] Kill 5xx thresh for non-existent cache_parsoid [puppet] - 10https://gerrit.wikimedia.org/r/283412 (owner: 10BBlack) [11:29:59] (03Abandoned) 10Giuseppe Lavagetto: dsh: create files based on exported resources [puppet] - 10https://gerrit.wikimedia.org/r/179121 (owner: 10Giuseppe Lavagetto) [11:33:22] 06Operations, 10ops-esams, 06DC-Ops, 10Traffic: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2206420 (10BBlack) FTR: I think I've done the blacklist hack on a couple more since, but not recorded them here. There was some suggestion elsewhere that we may need an iDRAC firmwa... [11:40:18] (03PS1) 10Filippo Giunchedi: depool upload/eqiad for codfw switchover [dns] - 10https://gerrit.wikimedia.org/r/283416 [11:41:34] (03PS2) 10Muehlenhoff: Enable base::firewall on rdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/283415 [11:41:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on rdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/283415 (owner: 10Muehlenhoff) [11:42:28] (03CR) 10Giuseppe Lavagetto: [C: 031] Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) (owner: 10Elukey) [11:42:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) (owner: 10Elukey) [11:45:21] (03PS2) 10Giuseppe Lavagetto: role::jobqueue_redis: add monitoring of the redis instances [puppet] - 10https://gerrit.wikimedia.org/r/282950 [11:48:00] (03PS1) 10Elukey: Bumping version to 0.13 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/283417 (https://phabricator.wikimedia.org/T132154) [11:48:39] (03PS2) 10Elukey: Bumping version to 0.1.3 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/283417 (https://phabricator.wikimedia.org/T132154) [11:48:49] (03CR) 10Elukey: [C: 032] Bumping version to 0.1.3 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/283417 (https://phabricator.wikimedia.org/T132154) (owner: 10Elukey) [11:48:58] (03CR) 10Elukey: [V: 032] Bumping version to 0.1.3 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/283417 (https://phabricator.wikimedia.org/T132154) (owner: 10Elukey) [11:49:46] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2206447 (10BBlack) Should be fixed now. I did: ``` root@neodymium:~# salt -v -t 10 -L `cat exmobile` cmd.run 'rm -f /etc/logrotate.d/varnishkafka-*' ``... [11:50:10] (03PS3) 10Filippo Giunchedi: varnish: switch upload codfw from 'eqiad' to 'direct' [puppet] - 10https://gerrit.wikimedia.org/r/282891 (https://phabricator.wikimedia.org/T129089) [11:50:12] (03PS2) 10Filippo Giunchedi: varnish: switch text 'rendering' app from 'eqiad' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/282893 (https://phabricator.wikimedia.org/T129089) [11:50:14] (03PS2) 10Filippo Giunchedi: varnish: switch upload eqiad from 'direct' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/282892 (https://phabricator.wikimedia.org/T129089) [11:50:16] (03PS1) 10Filippo Giunchedi: varnish: switch esams from 'eqiad' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/283418 (https://phabricator.wikimedia.org/T129089) [11:50:22] (03PS3) 10Giuseppe Lavagetto: role::jobqueue_redis: add monitoring of the redis instances [puppet] - 10https://gerrit.wikimedia.org/r/282950 [11:52:07] (03PS1) 10ArielGlenn: move mwbzutils and production dump scripts to subdirs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/283419 [11:53:23] !log deployed new puppet-compiler version - 0.1.3 (adding submodules support) [11:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:53:50] (03CR) 10ArielGlenn: [C: 032] move mwbzutils and production dump scripts to subdirs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/283419 (owner: 10ArielGlenn) [11:55:37] moritzm: thanks for fixing salt on labsdb1003 [11:57:21] 06Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2206462 (10ema) Two workarounds seem to be possible. Creating `/etc/initramfs-tools/scripts/local-top/mdadm-enhance-your-calm` with a 2 seconds sleep and running `update-initramfs -u -k... [11:58:17] (03PS1) 10ArielGlenn: update location of upstream repo for git, browser [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/283420 [11:59:06] (03Abandoned) 10Filippo Giunchedi: varnish: switch text 'rendering' app from 'eqiad' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/282893 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [12:00:59] (03PS2) 10BBlack: switchover: switch api/appservers/rendering varnish routing from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282910 [12:04:38] (03PS1) 10Muehlenhoff: Enable base::firewall on remaining rdb1* systems [puppet] - 10https://gerrit.wikimedia.org/r/283421 [12:07:20] (03PS3) 10Muehlenhoff: Use yubiauth::server role for auth2001 [puppet] - 10https://gerrit.wikimedia.org/r/283406 [12:07:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use yubiauth::server role for auth2001 [puppet] - 10https://gerrit.wikimedia.org/r/283406 (owner: 10Muehlenhoff) [12:08:41] RECOVERY - confd service on cp1008 is OK: OK - confd is active [12:09:06] (03CR) 10ArielGlenn: [C: 032 V: 032] update location of upstream repo for git, browser [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/283420 (owner: 10ArielGlenn) [12:14:31] PROBLEM - confd service on cp1008 is CRITICAL: CRITICAL - Expecting active but unit confd is activating [12:23:18] (03PS1) 10Dereckson: Babel configuration for uz.wikipedia (part 1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283425 (https://phabricator.wikimedia.org/T131924) [12:25:53] (03CR) 10Dereckson: [C: 031] Change voteWiki to fa language temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283400 (https://phabricator.wikimedia.org/T132667) (owner: 10Jalexander) [12:36:44] (03PS1) 10BBlack: codfw switch: codfw caches -> direct [puppet] - 10https://gerrit.wikimedia.org/r/283430 [12:36:46] (03PS1) 10BBlack: codfw switch: esams caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283431 [12:36:48] (03PS1) 10BBlack: codfw switch: eqiad caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283432 [12:37:47] (03PS1) 10BBlack: codfw switch: geodns depool of eqiad for users [dns] - 10https://gerrit.wikimedia.org/r/283433 [12:40:01] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2206587 (10akosiaris) [12:41:04] !log Ran initSiteStats.php for 10 wikis to fix negative/off-by-one statistics errors (T131306) [12:41:06] T131306: Reset file count statistics on a few wikis with negative/off-by-one errors - https://phabricator.wikimedia.org/T131306 [12:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:45:41] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204871 (10Anomie) PATH_INFO is fine and expected. The supposedly-bad PHP_SELF is probably also ok (at lea... [12:51:43] (03PS2) 10Muehlenhoff: Enable base::firewall on remaining rdb1* systems [puppet] - 10https://gerrit.wikimedia.org/r/283421 [12:52:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on remaining rdb1* systems [puppet] - 10https://gerrit.wikimedia.org/r/283421 (owner: 10Muehlenhoff) [12:53:21] (03PS1) 10Dereckson: New logo for vec.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283435 (https://phabricator.wikimedia.org/T132157) [12:55:20] (03CR) 10Dereckson: [C: 04-1] "SVG doesn't offer a scalable vectorial content." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283435 (https://phabricator.wikimedia.org/T132157) (owner: 10Dereckson) [12:56:28] (03PS2) 10Dereckson: New logo for vec.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283435 (https://phabricator.wikimedia.org/T132157) [12:56:34] (03PS1) 10Muehlenhoff: Enable base::firewall on remaining rdb1* systems [puppet] - 10https://gerrit.wikimedia.org/r/283436 [12:56:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on remaining rdb1* systems [puppet] - 10https://gerrit.wikimedia.org/r/283436 (owner: 10Muehlenhoff) [12:58:29] (03CR) 10Dereckson: "PS2: removed HD content" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283435 (https://phabricator.wikimedia.org/T132157) (owner: 10Dereckson) [13:07:36] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2206663 (10elukey) 05Open>03Resolved [13:07:39] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2206664 (10elukey) [13:07:55] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204366 (10Anomie) Whether or not QUERY_STRING is encoded incorrectly, that doesn't look like the cause of your problem... [13:21:24] RECOVERY - confd service on cp1008 is OK: OK - confd is active [13:25:07] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2206706 (10BBlack) [13:25:10] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2206705 (10BBlack) [13:27:05] PROBLEM - confd service on cp1008 is CRITICAL: CRITICAL - Expecting active but unit confd is activating [13:27:25] 06Operations, 10Traffic, 07HTTPS: enable https for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T132450#2206709 (10BBlack) [13:27:27] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2206708 (10BBlack) [13:28:19] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2198925 (10BBlack) [13:28:38] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/282968 (https://phabricator.wikimedia.org/T126262) (owner: 10Andrew Bogott) [13:30:22] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2206747 (10BBlack) [13:30:39] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2206750 (10BBlack) [13:30:42] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#366296 (10BBlack) [13:36:02] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:12] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack/Setup 6 new pool servers - https://phabricator.wikimedia.org/T132684#2206767 (10Cmjohnson) [13:38:52] (03PS1) 10Dereckson: Add museumcommons.wikimedia.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283439 (https://phabricator.wikimedia.org/T131841) [13:43:29] (03PS1) 10Muehlenhoff: Enable base::firewall on tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283441 [13:44:14] 06Operations, 10Traffic, 07HTTPS: Preload STS for wikimedia.org - https://phabricator.wikimedia.org/T132685#2206791 (10BBlack) [13:44:32] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2206806 (10BBlack) [13:44:35] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2206807 (10BBlack) [13:44:37] 06Operations, 10Traffic, 07HTTPS: Preload STS for wikimedia.org - https://phabricator.wikimedia.org/T132685#2206791 (10BBlack) [13:46:33] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [13:48:41] 06Operations, 10Traffic, 06Zero, 13Patch-For-Review: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2206829 (10BBlack) Holding on merging the above until after the codfw-switchover week, so as not to create too many overlapping effects... [13:49:00] (03CR) 10BBlack: [C: 04-1] "On hold for now, will look at merging after codfw-switchover week" [dns] - 10https://gerrit.wikimedia.org/r/283364 (https://phabricator.wikimedia.org/T124482) (owner: 10BBlack) [13:50:54] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:32] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [13:51:54] 06Operations, 10Traffic, 07Performance: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#2206839 (10BBlack) On the Zero issues: the latest update from the Zero team is they still have exactly one carrier that cares about the multimed... [13:56:48] 06Operations, 10Traffic, 07Performance: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#2206854 (10BBlack) Also should note (I thought it was mentioned earlier, but apparently not): we're not even sure how we'd structure this at the... [14:03:43] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10Krenair) [14:08:03] 06Operations, 10Traffic, 07Performance: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#2206896 (10BBlack) 05Open>03stalled [14:13:16] (03PS17) 10Ottomata: Kafka config: Add config functions [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [14:15:46] (03Abandoned) 10Dereckson: Add museumcommons.wikimedia.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283439 (https://phabricator.wikimedia.org/T131841) (owner: 10Dereckson) [14:17:21] (03PS2) 10Dereckson: Adding museumcommons.wikimedia.nl on $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) (owner: 10MarcoAurelio) [14:19:25] (03CR) 10Dereckson: "PS2: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) (owner: 10MarcoAurelio) [14:19:35] (03CR) 10Dereckson: [C: 031] Adding museumcommons.wikimedia.nl on $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) (owner: 10MarcoAurelio) [14:20:03] !log cr1-esams: enabling IPv4/IPv6 BGP sessions with TeliaSonera [14:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:25] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2206943 (10Milimetric) done https://gerrit.wikimedia.org/r/#/c/283446/ [14:24:00] !log rebooting cp3022 to test workaroud for T131961 [14:24:01] T131961: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961 [14:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:02] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:02] (03PS1) 10BBlack: Remove decommed esams caches from disk list [puppet] - 10https://gerrit.wikimedia.org/r/283447 [14:25:48] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2206951 (10Milimetric) the pageview API is running on wikimedia.org, that's the prod cluster. This task right now is about having a production domain for reports like this: https://brows... [14:26:13] RECOVERY - Host cp3022 is UP: PING OK - Packet loss = 0%, RTA = 83.74 ms [14:26:52] (03CR) 10BBlack: [C: 032] Remove decommed esams caches from disk list [puppet] - 10https://gerrit.wikimedia.org/r/283447 (owner: 10BBlack) [14:27:17] (03PS1) 10Dereckson: Enable GuidedTour extension on sq.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283450 (https://phabricator.wikimedia.org/T132412) [14:28:40] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2206958 (10Milimetric) > Potentially tangentially: I'm unclear what the distinction between Labs and production is when the Wikimedia Foundation is running/operating both. Are there other... [14:29:00] 06Operations, 07Graphite: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#2206963 (10fgiunchedi) a:05fgiunchedi>03None auto assigned by mistake while cloning [14:29:49] 07Puppet, 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service, 13Patch-For-Review, 03Scap3: Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2206969 (10Krenair) [14:30:29] 06Operations, 07Graphite, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2206972 (10fgiunchedi) [14:30:31] 06Operations, 10MediaWiki-General-or-Unknown, 07Graphite: mediawiki statsd traffic - https://phabricator.wikimedia.org/T132472#2206970 (10fgiunchedi) 05Open>03Resolved fixed by https://gerrit.wikimedia.org/r/#/c/283001/ and https://gerrit.wikimedia.org/r/#/c/282990/ [14:30:53] 07Puppet, 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service, 03Scap3: Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2192991 (10Krenair) Yep, now AQS [14:32:48] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2206982 (10fgiunchedi) [14:32:50] 06Operations, 06Services, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase100x - https://phabricator.wikimedia.org/T113714#2206981 (10fgiunchedi) [14:33:11] (03PS2) 10Elukey: This is a test for the puppet compiler, not meant to be committed. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/282670 [14:33:33] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2206983 (10BBlack) >>! In T132407#2205680, @MZMcBride wrote: > Potentially tangentially: I'm unclear what the distinction between Labs and production is when the Wikimedia Foundation is r... [14:34:47] (03PS4) 10Andrew Bogott: Include php5-readline package on deployment servers. [puppet] - 10https://gerrit.wikimedia.org/r/282968 (https://phabricator.wikimedia.org/T126262) [14:35:08] (03PS1) 10Muehlenhoff: Add gnome-pkg-tools to package_pbuider file list [puppet] - 10https://gerrit.wikimedia.org/r/283451 [14:36:05] (03CR) 10Andrew Bogott: [C: 032] Include php5-readline package on deployment servers. [puppet] - 10https://gerrit.wikimedia.org/r/282968 (https://phabricator.wikimedia.org/T126262) (owner: 10Andrew Bogott) [14:37:03] (03PS18) 10Mobrovac: Kafka config: Add config functions [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) [14:37:23] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:33] RECOVERY - Host cp3022 is UP: PING OK - Packet loss = 0%, RTA = 83.65 ms [14:40:39] 06Operations, 13Patch-For-Review: Install php5-readline on trusty and jessie hosts so eval.php, sql.php, and so on are more useful - https://phabricator.wikimedia.org/T126262#2207005 (10Andrew) 05Open>03Resolved a:03Andrew [14:41:11] (03CR) 10Mobrovac: "mucho bueno - https://puppet-compiler.wmflabs.org/2448/" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [14:43:05] (03PS2) 10Alex Monk: shinken: Remove old beta bits check [puppet] - 10https://gerrit.wikimedia.org/r/282498 [14:46:54] 06Operations, 10ops-esams: Replace cr2-knams MX80 MIC slot with a 2x10G MIC - https://phabricator.wikimedia.org/T111765#2207031 (10faidon) [14:54:02] 06Operations, 10netops: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#2207068 (10mark) [14:58:13] jouncebot next [14:58:13] In 0 hour(s) and 1 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1500) [14:58:21] anyone looking at that tin disk space alert? [14:59:30] /dev/mapper/tin--vg-root 40G 36G 2.1G 95% / [14:59:47] er [14:59:51] and now /dev/mapper/tin--vg-root 37G 33G 1.9G 95% / [15:00:04] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1500). [15:00:04] Jamesofur Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:14] Something is filling quickly a file. [15:02:31] Might be units [15:02:32] paravoid: green light to use Tin for deployment or should we use mira as fallback (stable at 1.9G, not decreasing anymore)? [15:02:39] Still shows as 2.1G for me [15:03:08] Reedy: I used df -h [15:03:38] Someone should clear out /tmp for starters [15:05:15] wtf are all these l10nupdate under /tmp? [15:05:23] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:18] _joe_, in case you didn't see .. i left a couple comments on https://gerrit.wikimedia.org/r/#/c/282904/ y'day. [15:06:34] 2.5G /tmp/make-wmf-branch [15:07:17] g5361tutu [15:07:30] what is that? [15:07:34] looks like a password [15:07:44] paravoid: make-wmf-branch can be removed. That was the new branch cut. [15:07:49] the aqs endpoint issue is related to https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase - basically we see timeouts in cassandra due to high latency reads (a combination of rotating disks and how sstables are stored). We are waiting for new nodes with SSDs.. [15:07:54] Or an uninspired username. [15:07:57] thcipriani: can you cleanup tin's /tmp? [15:08:08] thcipriani: including that :) [15:08:27] thcipriani: I'm asking you since you might notice things that need more permanent fixing [15:09:02] bd808: tin:~bd808/tmp/core is 1.5G [15:09:18] paravoid: sure, digging. I'm trying to figure out the l10nupdate thing now. [15:09:44] sudo -u l10nupdate rm -rf TOO_MANY_ARGS [15:09:55] * gehel oops... [15:10:05] and twentyafterfour: tin:~twentyafterfour is 2.7G [15:10:08] time to change password... [15:10:15] Reedy: | xargs -nN is your friend [15:10:18] yeah [15:10:33] paravoid: could you delete /tmp/make-wmf-branch? It's a known, I don't have permissions. [15:11:08] paravoid: cleaned up. looks like I was doing something with the 1.27.0-wmf.13 branch and left cruft around [15:12:22] thcipriani: done [15:12:38] We're now at 6.4G. [15:13:00] _joe_: can you explain to me the difference between hieradata/role/eqiad/swift/storage.yaml and hieradata/role/common/swift/eqiad_prod/storage.yaml? I'm unfamiliar with whatever it is that resolves the latter path [15:13:12] csteipp, RoanKattouw: your home directories on tin are also 1.9G/1.1G respectively (it's okay if they're useful things, but please cleanup if it's cruft or > 90-day-old PII) [15:13:32] <_joe_> the first defines data for eqiad nodes only [15:13:35] <_joe_> the second is global [15:13:37] Yeah, I can clean most of that up.. [15:13:58] <_joe_> the first gets looked up first, so if a match is found, it will override the default [15:13:59] !log uploaded librsvg 2.40.5-1+deb8u1+wmf1 to carbon (T132584) [15:14:00] T132584: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584 [15:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:08] <_joe_> wait, the two paths are different [15:14:47] <_joe_> andrewbogott: ok so, I /guess/ it's a remainder from the swift/swift_new transition [15:14:52] _joe_: it's the 'eqiad_prod' part that I don't understand. [15:14:56] <_joe_> I have no idea if that was ever completed [15:14:57] no, it's not [15:14:59] yes, it was [15:15:02] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [15:15:06] it's just our hiera setup being nuts :) [15:15:09] Evidence suggests that that's dead code — think I can just rip it out? [15:15:16] Okay, so with 6.6 Gb free, we're good and can start the SWAT now I guess. [15:15:18] <_joe_> paravoid: I think it's just dead code [15:15:27] <_joe_> andrewbogott: I agree, let me take a look 1 second [15:15:42] <_joe_> andrewbogott: we used to have role::swift::eqiad_prod::* [15:15:43] thx [15:16:06] Jamesofur: ping? [15:16:09] oh! it makes sense if that was just a literal role name [15:16:14] <_joe_> yes [15:16:16] Dereckson: yup here [15:16:17] thcipriani: If the "l10nupdate thing" is what I think it is, my prior guess was that there was a un-puppeted cron on the old tin that pruned the nightly build temp files [15:16:32] the swift manifests are a prime example of how crazy our hiera setup is [15:17:00] people have heard me ranting about this before [15:17:17] just find hieradata | grep swift for the full glory :) [15:17:46] paravoid: done [15:17:47] <_joe_> paravoid: half of it is dead code, yuck [15:17:51] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283400 (https://phabricator.wikimedia.org/T132667) (owner: 10Jalexander) [15:17:51] _joe_: nope [15:18:05] bd808: that would make sense. I don't recognize the file names. The contents are just large serialized arrays. [15:18:16] (03Merged) 10jenkins-bot: Change voteWiki to fa language temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283400 (https://phabricator.wikimedia.org/T132667) (owner: 10Jalexander) [15:18:48] <_joe_> andrewbogott: the esams_prod directory is dead code as well [15:18:49] Jamesofur: Temporary hacks... [15:19:28] <_joe_> andrewbogott: verify it, but I'm pretty much convinced it is [15:19:29] mlitn: do you need these aftv5 files in your ~? [15:19:35] mlitn: (on tin) [15:19:44] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2207113 (10Andrew) @dzahn, I'm talking to _joe_ about this right now. You've stumbled into a puddle of dead code, I will try to straighten this out. [15:19:49] I have no idea what they are, let me check [15:19:51] (probably not) [15:20:07] Reedy: hmmm, not so much temp hack as its a voting wiki and the next election is in a non English language. It only makes sense to make the default that language because there are so many reasons it could fall back and out of the secure poll process. I blame media wiki :) [15:20:39] Reedy: I've an error during sync-apaches step: [15:20:41] error: [Errno 111] Connection refused [15:20:44] I resync? [15:20:53] So I just consider it "getting ready for its next use case" [15:20:53] Or should I ask mutante to rearm? [15:21:04] Dereckson: which host? [15:21:12] ^ [15:21:16] bd808: from tin [15:21:22] paravoid: I didn’t; just removed them [15:21:23] target host not in log [15:21:30] hmm [15:21:33] (03PS1) 10Andrew Bogott: Hiera: Rip out some dead swift rules [puppet] - 10https://gerrit.wikimedia.org/r/283454 [15:21:38] run again with verbosity? [15:22:01] (03CR) 10Andrew Bogott: "let me just double-check with the puppet compiler..." [puppet] - 10https://gerrit.wikimedia.org/r/283454 (owner: 10Andrew Bogott) [15:22:09] bd808: https://etherpad.wikimedia.org/p/SWAT_2016-04-14 [15:23:01] Dereckson: try again with -v [15:23:05] it'll spew a lot [15:23:05] yup, doing [15:23:14] sync-common: 100% (ok: 429; fail: 0; left: 0) [15:23:17] That suggests it happened [15:23:20] Oh.. [15:23:25] bd808: Are we still notifying profiler stuff? [15:23:31] File "/usr/lib/python2.7/dist-packages/scap/log.py", line 87, in emit [15:23:42] Where is logmsgbot? [15:23:47] that error is the !log message failing [15:23:59] File "/usr/lib/python2.7/dist-packages/scap/log.py", line 87, in emit [15:24:22] So tin isn't connecting to neon? [15:24:31] (03CR) 10Giuseppe Lavagetto: [C: 031] "It all looks like dead code to me; verify it with the compiler nonetheless." [puppet] - 10https://gerrit.wikimedia.org/r/283454 (owner: 10Andrew Bogott) [15:24:51] <_joe_> or maybe the bot is dead? [15:24:59] quite possibly [15:25:05] Jamesofur: seems to work, but homepage should be moved to https://vote.wikimedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C or you should create a redirect from there to Main Page [15:25:09] [16:23:42] Where is logmsgbot? [15:25:10] ;) [15:25:25] Dereckson: thanks, we will do [15:25:27] Reedy: you offer we create a task to make vote. multilingual? [15:25:40] Dereckson: sorry? [15:26:18] $ telnet neon.wikimedia.org 9200 -- telnet: Unable to connect to remote host: Connection refused [15:26:18] You were complaining about a temporary hack to switch vote.wikimedia language to Persan. Should we create a task to set the wiki more like Commons/meta/Wikidata? [15:26:29] <_joe_> bd808: it's just waiting for reading from freenode [15:27:00]  [15:27:02] _joe_: do you know, does the puppet compiler mode where you don't specify nodes work properly these days? [15:27:03] Testing dologmsg from tin [15:27:03] <_joe_> I just restarted it :) [15:27:15] * andrewbogott tries it [15:27:17] it's working now [15:27:18] !log Synchronized wmf-config/InitialiseSettings.php: Change voteWiki to fa language temporarily (T132667) (duration: 00m 34s) [15:27:19] T132667: Setup VoteWiki for faWP use - https://phabricator.wikimedia.org/T132667 [15:27:20] <_joe_> andrewbogott: it does, but why would you want to do that? [15:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:31] andrewbogott: works [15:27:38] _joe_: If I'm testing for dead code, it seems like the only way to be sure [15:27:43] <_joe_> no [15:27:53] <_joe_> the only way is test the swift machines [15:28:01] <_joe_> with a regex host specification [15:28:12] <_joe_> like [15:28:18] Well, ok, but in general… If I think code is dead then of course I don't know what nodes to test on since I believe the answer to be "no nodes at all" [15:28:20] (03PS2) 10Dereckson: Add bio.acousti.ca to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283414 (https://phabricator.wikimedia.org/T132140) [15:28:27] <_joe_> 're:ms-[bf]e.*' [15:28:28] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283414 (https://phabricator.wikimedia.org/T132140) (owner: 10Dereckson) [15:28:32] <_joe_> andrewbogott: ^^ [15:28:56] (03Merged) 10jenkins-bot: Add bio.acousti.ca to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283414 (https://phabricator.wikimedia.org/T132140) (owner: 10Dereckson) [15:29:50] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add bio.acousti.ca to wgCopyUploadsDomains (T132140) (duration: 00m 27s) [15:29:50] T132140: Please add bio.acousti.ca to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T132140 [15:30:15] Testing. [15:30:21] Works. [15:30:47] Dereckson: This is an interesting case of self review :P. But you got my +1 [15:30:50] posthumous [15:30:50] :P [15:31:05] (03PS3) 10Dereckson: Adding museumcommons.wikimedia.nl on $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) (owner: 10MarcoAurelio) [15:31:09] gah, I hate it, if my client submits a comment if I'm posting something [15:31:12] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) (owner: 10MarcoAurelio) [15:31:35] Luke081515: if you wish to review all SWAT patches, they're at https://etherpad.wikimedia.org/p/SWAT_2016-04-14 [15:31:41] (03Merged) 10jenkins-bot: Adding museumcommons.wikimedia.nl on $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) (owner: 10MarcoAurelio) [15:31:46] (03PS1) 10Giuseppe Lavagetto: Revert "confd: removing redundant hiera data" [puppet] - 10https://gerrit.wikimedia.org/r/283458 [15:31:55] (03PS2) 10Giuseppe Lavagetto: Revert "confd: removing redundant hiera data" [puppet] - 10https://gerrit.wikimedia.org/r/283458 [15:31:59] <_joe_> bblack: ^^ [15:32:00] Dereckson: No, no problem, I trust you in things like that easy domains :D [15:32:13] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:19] <_joe_> I forgot about machines with wikimedia.org addresses in eqiad... [15:32:44] heh [15:32:55] (03PS1) 10Ema: Workaround for mdadm boot-time race condition [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) [15:32:58] that probably includes the active LVSes in eqiad? [15:33:04] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add museumcommons.wikimedia.nl to wgCopyUploadsDomains (T131841) (duration: 00m 26s) [15:33:05] T131841: Please add museumcommons.wikimedia.nl to GW toolset whitelist - https://phabricator.wikimedia.org/T131841 [15:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:15] Testing. [15:33:24] PROBLEM - HHVM rendering on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:29] paravoid: I'm sure I had a good reason to put a clone of MW core in my home dir in May 2013, but I don't any more. Deleted now [15:33:40] RoanKattouw: :P [15:33:42] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "breaks cp1008, this is not the correct fix, but it is indeed a fix" [puppet] - 10https://gerrit.wikimedia.org/r/283458 (owner: 10Giuseppe Lavagetto) [15:33:49] Works. [15:34:18] 12G free now, not bad :) [15:34:52] (03PS3) 10Dereckson: New logo for vec.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283435 (https://phabricator.wikimedia.org/T132157) [15:34:59] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283435 (https://phabricator.wikimedia.org/T132157) (owner: 10Dereckson) [15:35:34] (03Merged) 10jenkins-bot: New logo for vec.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283435 (https://phabricator.wikimedia.org/T132157) (owner: 10Dereckson) [15:35:46] And I don't know what ~/.hhvm.hhbc was, but it was 71MB so that's now gone too [15:35:58] I was confused that my dotfiles were so big [15:37:18] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add task identifier comment to vec.wikisource logo change (no-op, T132157) (duration: 00m 27s) [15:37:19] T132157: Change vec.wikisource project logo - https://phabricator.wikimedia.org/T132157 [15:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:07] !log dereckson@tin Synchronized static/images/project-logos/vecwikisource.png: New logo for vec.wikisource (T132157) (duration: 00m 26s) [15:38:08] T132157: Change vec.wikisource project logo - https://phabricator.wikimedia.org/T132157 [15:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:42] RECOVERY - confd service on cp1008 is OK: OK - confd is active [15:38:49] Works. [15:39:18] (03PS2) 10Dereckson: HD logo for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282305 (https://phabricator.wikimedia.org/T132120) [15:39:27] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282305 (https://phabricator.wikimedia.org/T132120) (owner: 10Dereckson) [15:39:53] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.040 second response time [15:40:09] (03Merged) 10jenkins-bot: HD logo for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282305 (https://phabricator.wikimedia.org/T132120) (owner: 10Dereckson) [15:41:04] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 66771 bytes in 0.094 second response time [15:41:47] !log dereckson@tin Synchronized static/images/project-logos/ladwiki-1.5x.png: HD logo for lad.wikipedia (T132120, 1/3) (duration: 00m 25s) [15:41:48] T132120: HD version of lad.wikipedia.org logo - https://phabricator.wikimedia.org/T132120 [15:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:15] RoanKattouw: ~/.hhvm.hhbc would be your per-user HHVM bytecode cache [15:42:23] !log dereckson@tin Synchronized static/images/project-logos/ladwiki-2x.png: HD logo for lad.wikipedia (T132120, 2/3) (duration: 00m 25s) [15:42:24] T132120: HD version of lad.wikipedia.org logo - https://phabricator.wikimedia.org/T132120 [15:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:55] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: HD logo for lad.wikipedia (T132120, 3/3) (duration: 00m 26s) [15:42:56] T132120: HD version of lad.wikipedia.org logo - https://phabricator.wikimedia.org/T132120 [15:42:57] Someone with an iPad or an iPhone for a retina test? [15:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:32] (03PS4) 10Ladsgroup: [WIP] ores: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [15:44:46] Code for HD logo is in CSS. Logos are reacheable at the /static URL, so looks good to me. [15:45:32] (03CR) 10Ladsgroup: [C: 04-1] "Very WIP, Needs lots of work, It's necessary to hiera-fy everything here" [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [15:45:34] (03CR) 10jenkins-bot: [V: 04-1] [WIP] ores: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [15:45:36] (03CR) 10BBlack: "seems to be for all hosts, but I don't know if this is valid outside jessie?" [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [15:45:40] Go for Kartographer. [15:45:55] (03PS2) 10Dereckson: Set wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 [15:46:03] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [15:46:54] (03Merged) 10jenkins-bot: Set wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [15:48:50] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set wgKartographerWikivoyageMode (duration: 00m 26s) [15:48:52] Testing. [15:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:43] (03PS1) 10Filippo Giunchedi: logstash: transform nulls as zero for graphite alert [puppet] - 10https://gerrit.wikimedia.org/r/283461 [15:49:50] Ok, works. [15:50:05] (03PS3) 10Dereckson: Enable Kartographer on pl.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283340 (https://phabricator.wikimedia.org/T132510) [15:50:15] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283340 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [15:50:42] (03Merged) 10jenkins-bot: Enable Kartographer on pl.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283340 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [15:51:48] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Kartographer on pl.wikimedia (T132510) (duration: 00m 26s) [15:51:49] T132510: Enable Kartographer on WMPL wiki - https://phabricator.wikimedia.org/T132510 [15:51:50] Testing. [15:51:51] (03PS2) 10Filippo Giunchedi: logstash: transform nulls as zero for graphite alert [puppet] - 10https://gerrit.wikimedia.org/r/283461 [15:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:19] Works. [15:53:01] (03PS2) 10Dereckson: Babel configuration for uz.wikipedia (part 1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283425 (https://phabricator.wikimedia.org/T131924) [15:53:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] logstash: transform nulls as zero for graphite alert [puppet] - 10https://gerrit.wikimedia.org/r/283461 (owner: 10Filippo Giunchedi) [15:53:09] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283425 (https://phabricator.wikimedia.org/T131924) (owner: 10Dereckson) [15:53:29] ottomata, elukey: aqfinish=17909.5min speed=1638K/sec [15:53:38] (03Merged) 10jenkins-bot: Babel configuration for uz.wikipedia (part 1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283425 (https://phabricator.wikimedia.org/T131924) (owner: 10Dereckson) [15:53:46] Testing. [15:53:53] ottomata, elukey: that's aqs1001, you probably want to do something about that [15:53:56] Syncing before. [15:54:21] ottomata, elukey: it's possible that it's just the H310's crappiness, but that's just a guess, you should definitely investigate :) [15:54:57] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Babel configuration for uz.wikipedia (part 1/2) (T131924) (duration: 00m 26s) [15:54:58] T131924: Babel configuration for uz.wikipedia - https://phabricator.wikimedia.org/T131924 [15:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:23] Testing. [15:56:13] mafk: yeah, so I've enabled for them their root cat, as it's what seems more important for them: that gives User fr / User en / User de / User nl. [15:56:22] Works. [15:56:54] ktnx [15:57:00] paravoid: sorry can you give me more context about aqfinish=17909.5min speed=1638K/sec ? I was investigating a bot hitting the pageview api earlier on [15:57:10] (sorry I might have missed some conversation in the channel) [15:57:32] elukey: cat /proc/mdstat [15:57:40] ahhhhh [15:57:46] now it makes sense [15:57:46] mafk: do you think we should see for a dev change to allow two cats? (I'm not really in favour of that) [15:58:05] ottomata added a new disk yesterday to the RAID [15:58:16] I know [15:58:30] it's going to take days to finish at that rate, which is not normal [15:58:33] (03PS2) 10Dereckson: Enable GuidedTour extension on sq.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283450 (https://phabricator.wikimedia.org/T132412) [15:58:42] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283450 (https://phabricator.wikimedia.org/T132412) (owner: 10Dereckson) [15:59:06] paravoid: yep yep [15:59:10] (03Merged) 10jenkins-bot: Enable GuidedTour extension on sq.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283450 (https://phabricator.wikimedia.org/T132412) (owner: 10Dereckson) [15:59:10] thanks for the heads up [16:00:04] godog moritzm: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1600). Please do the needful. [16:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:04] godog: Dear anthropoid, the time has come. Please deploy codfw switchover for swift (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1600). [16:00:04] godog: A patch you scheduled for codfw switchover for swift is about to be deployed. Please be available during the process. [16:00:11] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable GuidedTour extension on sq.wikipedia (T132412) (duration: 00m 26s) [16:00:12] T132412: Enable GuidedTour extension on sq.wikipedia - https://phabricator.wikimedia.org/T132412 [16:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:27] hey [16:01:09] Testing. [16:01:28] Hi Krenair. [16:01:40] Krenair: hey, I'll be your puppet SWAT host [16:02:04] godog, so this change should be labs-only applying only on shinken [16:02:40] Seems to work, it's in Special:Version, but I can't check on https://sq.wikipedia.org/wiki/Wikipedia:TWA/1/Start if it uses or not guided tour features. [16:02:41] Krenair: yup, easy enough, merging [16:02:47] thanks [16:02:48] (03PS3) 10Filippo Giunchedi: shinken: Remove old beta bits check [puppet] - 10https://gerrit.wikimedia.org/r/282498 (owner: 10Alex Monk) [16:02:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] shinken: Remove old beta bits check [puppet] - 10https://gerrit.wikimedia.org/r/282498 (owner: 10Alex Monk) [16:02:58] So that's end the SWAT. [16:03:14] Krenair: {{done}} [16:03:31] Dereckson: I don't think it's necessary right now. [16:03:59] that is slow! [16:04:00] :/ [16:04:33] paravoid bblack AaronSchulz deploying the swift changes [16:04:33] <_joe_> why didn't jouncebot ping me? [16:04:38] <_joe_> meh I forgot [16:05:04] _joe_: puppet swat listed me and moritz btw [16:05:10] <_joe_> yeah my fault [16:05:27] <_joe_> godog: we should list a bunch of people and not change it every week [16:05:40] <_joe_> anyways, go on with the switchover [16:05:47] <_joe_> I'm here if you need assistance [16:05:51] thanks! [16:06:35] (03PS2) 10Filippo Giunchedi: Set synchronous swift writes for eqiad/codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282888 (https://phabricator.wikimedia.org/T129089) [16:06:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Set synchronous swift writes for eqiad/codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282888 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [16:08:07] !log filippo@tin Synchronized wmf-config/filebackend-production.php: swift codfw sync replication T129089 (duration: 00m 26s) [16:08:08] T129089: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089 [16:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:07] (03PS2) 10Andrew Bogott: Hiera: Rip out some dead swift rules [puppet] - 10https://gerrit.wikimedia.org/r/283454 [16:09:19] (03CR) 10Andrew Bogott: "puppet compiler approves" [puppet] - 10https://gerrit.wikimedia.org/r/283454 (owner: 10Andrew Bogott) [16:09:46] (03PS1) 10Gehel: Import elasticsearch 2.x into our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) [16:10:53] (03CR) 10Andrew Bogott: [C: 032] Hiera: Rip out some dead swift rules [puppet] - 10https://gerrit.wikimedia.org/r/283454 (owner: 10Andrew Bogott) [16:12:15] (03PS2) 10Filippo Giunchedi: varnish: route upload backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282890 (https://phabricator.wikimedia.org/T129089) [16:12:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: route upload backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282890 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [16:13:11] (03CR) 10Ema: "I'm not sure the problem is jessie-specific, although we have noticed it only on jessie systems AFAIK. Perhaps we should limit the workaro" [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [16:13:39] !log route upload backends to codfw - T129089 [16:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:42] T129089: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089 [16:14:44] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2207372 (10Nuria) [16:15:13] 06Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2207374 (10faidon) OK, I debugged this some more, and this was caused by a Juniper bug and one that I vaguelly recall experiencing before. The issue was th... [16:15:47] <_joe_> godog: are you forcing a puppet run on the cache hosts? [16:15:49] ottomata, robh ^^^ [16:15:56] installer is running now [16:16:43] <_joe_> network is switching, apparently [16:17:12] (03PS1) 10Andrew Bogott: Add swift-roots to swift storage and proxy boxes [puppet] - 10https://gerrit.wikimedia.org/r/283471 (https://phabricator.wikimedia.org/T130910) [16:17:36] nice!!!! [16:17:45] paravoid: how'd you do it? [16:17:49] _joe_: I am yeah, I'll let it simmer for some minutes [16:17:59] ottomata: see the task :) [16:18:20] <_joe_> andrewbogott: hold on with the changes [16:18:28] <_joe_> we're switching over swift [16:18:32] _joe_: ok [16:18:40] umm...why am I getting 401 on all Commons thumbsnails? [16:18:44] <_joe_> in general, let's not merge puppet changes until a switchover is done [16:18:51] 401 Unauthorized [16:18:51] This server could not verify that you are authorized to access the document you requested. Either you supplied the wrong credentials (e.g., bad password), or your browser does not understand how to supply the credentials required. [16:18:51] Token may have timed out [16:18:54] <_joe_> Josve05a: uh? an example? [16:18:59] https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Luzi%C3%A2nia-DF.jpg/120px-Luzi%C3%A2nia-DF.jpg [16:19:00] <_joe_> godog: ^^ [16:19:19] checking [16:19:20] <_joe_> godog: let's rollback and figure out what's happening? [16:19:39] yes [16:19:48] I'll rollback the thumbs only [16:20:08] <_joe_> Josve05a: all thumbnails or just new ones? [16:20:24] <_joe_> only new ones [16:20:26] (03CR) 10Andrew Bogott: "Tested with the puppet compiler on ms-be1002 and it had the desired effect. Waiting to merge until after the codfw swift switchover is co" [puppet] - 10https://gerrit.wikimedia.org/r/283471 (https://phabricator.wikimedia.org/T130910) (owner: 10Andrew Bogott) [16:20:34] manybubbles: https://gerrit.wikimedia.org/r/#/c/139819/ ← Does your comment about the two indexes issue still apply? [16:20:44] all thumbnails. Not in gallery mode or file preview at the top, but in File history and when in "Bacth tool" [16:20:45] <_joe_> well, any non cached thumb [16:20:46] batch* [16:21:00] Dereckson: manybubbles isn't around much... [16:21:00] <_joe_> Josve05a: yeah, we're rolling back a change now [16:21:13] thanks paravoid [16:21:23] (03PS1) 10Filippo Giunchedi: varnish: move thumbs back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283472 [16:21:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: move thumbs back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283472 (owner: 10Filippo Giunchedi) [16:22:11] <_joe_> Josve05a: ^^ in a few minutes those should be back to normal [16:22:18] ok, thanks :) [16:22:24] !log rollback varnish backends to eqiad for thumbs [16:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:05] <_joe_> godog: ack when the puppet runs are done [16:23:07] (03PS3) 10EBernhardson: Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 [16:23:11] I'll investigate meanwhile, thanks Josve05a for the report [16:23:26] <_joe_> I need to verify we didn't cache the bad values [16:23:29] just happy it was caught pretty quick [16:23:56] ok, now they are coming back [16:24:03] <_joe_> Josve05a: cool [16:24:30] <_joe_> and yes, luckily we don't cache 401s [16:24:57] (03CR) 10jenkins-bot: [V: 04-1] Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [16:25:01] <_joe_> godog: seems a credentials mismatch of some sorts? [16:26:10] _joe_: sort of, thumb containers lacking permissions in codfw, fixing ATM [16:28:04] paravoid: awesome, thank you! [16:28:38] (03PS4) 10EBernhardson: Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 [16:28:58] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2207393 (10NahidSultan) Another one: https://upload.wikimedia.org/wikipedia/commons/7/7f/Sajid-Monkey-Bizness.w... [16:30:12] so what's the thumbs 401 thing? [16:30:29] <_joe_> "thumb containers lacking permissions in codfw" [16:30:42] PROBLEM - Host lead is DOWN: PING CRITICAL - Packet loss = 100% [16:30:48] <_joe_> wat? [16:31:27] <_joe_> is someone taking lead down? [16:31:34] not me! [16:32:04] 06Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2207399 (10faidon) The network config was reverted and it still works, so it should be final now. However, the installer failed at the last step with: ```... [16:32:05] apparently it's a future host for gerrit, but not yet, just "include standard" [16:32:30] <_joe_> it's up btw [16:32:41] <_joe_> console show a jessie prompt [16:32:53] well faidon just reverted something on a network switch [16:33:01] maybe some port labeling screwup, etc? [16:33:03] that's not related [16:33:38] unless the switch is _that_ buggy :) [16:33:39] thangs are breaking down arend y'all :/ xD [16:33:42] around* [16:33:53] (and I can't spell for shit...) [16:33:58] (03PS2) 10Andrew Bogott: Add swift-roots to swift storage and proxy boxes [puppet] - 10https://gerrit.wikimedia.org/r/283471 (https://phabricator.wikimedia.org/T130910) [16:34:00] (03PS1) 10Andrew Bogott: Prune another dead hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/283476 [16:34:33] well [16:34:43] the switch was misconfigured for lead to begin with [16:34:52] it does not belong to any interface ranges [16:35:00] by commit to an unrelated port probably reset that [16:35:28] gotta love these juniper bugs [16:35:41] nice [16:36:07] adding, let's see [16:36:22] yup [16:36:22] it's back [16:36:26] awesome [16:36:29] RECOVERY - Host lead is UP: PING OK - Packet loss = 0%, RTA = 3.91 ms [16:36:49] I literally configured and then unconfigured an entirely different port (an1003's) to workaround a juniper bug [16:37:17] shouldn't all our in-use ports be in some range for vlan assignment through? [16:37:20] *though [16:40:27] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2207408 (10Milimetric) @BBlack: just in case there's some concern about what the purpose of analytics.wikimedia.org is. We will never use it to proxy to services / dashboards on labs. W... [16:42:16] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2207412 (10Dereckson) There are 404 now, it's probably the time for the cache to expire. [16:45:38] thanks all for the quick thumb issue fix btw [16:46:21] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2207420 (10ori) >>! In T129963#2179841, @Joe wrote: > I don't think measuring latencies for memcached (where they are usually around 1 ms) is that significant; improving the cache hi... [16:47:08] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2207427 (10ori) @elukey, is this still on your radar? [16:47:31] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2207429 (10BBlack) @Milimetric - I guess what I'm missing here is the disconnect between our public termination of analytics.wikimedia.org (on, say, cache_misc) and what "a subfolder on a... [16:49:48] !log run on tin: mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php commonswiki --backend=local-multiwrite [16:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:06] (03CR) 10Ori.livneh: "{citation needed}" [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) (owner: 10Faidon Liambotis) [16:50:55] ori: come on [16:51:27] what? did you investigate it? i remember you were starting to, but i was not privvy to your findings [16:52:23] if it's even close, it's a win on not maintaining local C code we don't have to [16:53:00] ori: re: memcached - sure I it but IIRC you were going to add some thoughts and things to do before proceeding, so I was waiting (but I may have got it wrong) [16:53:25] (03CR) 10Andrew Bogott: "compiler confirms that this is a no-op" [puppet] - 10https://gerrit.wikimedia.org/r/283476 (owner: 10Andrew Bogott) [16:53:36] (03PS1) 10Faidon Liambotis: ganeti: bump check_procs for noded to 1:2 [puppet] - 10https://gerrit.wikimedia.org/r/283480 [16:54:03] (03CR) 10Faidon Liambotis: [C: 032 V: 032] ganeti: bump check_procs for noded to 1:2 [puppet] - 10https://gerrit.wikimedia.org/r/283480 (owner: 10Faidon Liambotis) [16:54:14] elukey: maybe information about slab config options? there is no real opportunity there, in the end, but i can update the task to substantiate that [16:54:39] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2207462 (10BBlack) (for services small enough to not need a cluster of their own hardware, I think we do have solutions where we virtualize smaller services on ganeti, too. The above is... [16:54:46] ori: all right I'll make a plan next week then :) [16:55:33] bblack: maybe, but putting in a bit of work to measure the difference is not time wasted, imo [16:55:48] Hi godog. I see you're playing with local-multiwrite backend, would you know what to do when an upload fails, then we got this message retrying to upload it again: The file "mwstore://local-multiwrite/local-public/0/09/Президент_России_—_2016-03-11_—_Единый_день_приёмки_военной_продукции.webm" [16:55:54] is in an inconsistent state within the internal storage backends [16:56:13] (03PS1) 10Jdrewniak: Updating portals to master. removing top-links A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283481 (https://phabricator.wikimedia.org/T124116) [16:56:33] paravoid: what did you mean by "come on"? IIRC you agreed that some testing was in order. I am not calling your testing inadequate, I'm just saying I'm not privy to your findings. [16:57:00] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2207486 (10Cmjohnson) @volans Hot spots do not appear to exist for these 3 servers. The cool air intake is the same as neighboring servers 1065 intake temp is 24°C and 1070/71 is 25°... [16:58:06] Dereckson: sec, I think that's T128096 [16:58:07] T128096: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096 [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1700). [17:02:54] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2207518 (10NahidSultan) >>! In T109331#2207412, @Dereckson wrote: > There are 404 now, it's probably the time f... [17:03:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:05:03] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2207527 (10Dereckson) CTRL + SHIFT + R should do the trick in your browser. It's not possible to invalidate a c... [17:11:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:13:39] PROBLEM - statsdlb process on graphite1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [17:14:14] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2207564 (10NahidSultan) >>! In T109331#2207527, @Dereckson wrote: > It's not possible to invalidate a cached fi... [17:14:38] !log set zone access for all - private wikis in codfw [17:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:30] RECOVERY - statsdlb process on graphite1001 is OK: PROCS OK: 1 process with command name statsdlb [17:18:22] <_joe_> godog: should we try again? [17:18:38] _joe_: yeah as soon as it is finished [17:20:22] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2207584 (10matmarex) Hmm, okay, looks like I was mostly wrong then. I think there's a bug in mw.ForeignApi JS code too,... [17:26:26] (03PS19) 10Ottomata: Kafka config: Add config functions [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [17:30:27] (03CR) 10Ottomata: [C: 032] Kafka config: Add config functions [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [17:31:02] (03PS1) 10Gehel: Revert "remove wdqs1002 from varnish during reinstall / fix" [puppet] - 10https://gerrit.wikimedia.org/r/283485 (https://phabricator.wikimedia.org/T132387) [17:33:09] (03PS5) 10Ladsgroup: [WIP] ores: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [17:34:20] (03PS1) 10Cmjohnson: Adding graphite1003 mac to dchpd file [puppet] - 10https://gerrit.wikimedia.org/r/283486 [17:34:56] (03PS1) 10Ottomata: Update varnishkafka module with should_subscribe fix [puppet] - 10https://gerrit.wikimedia.org/r/283488 [17:35:24] (03CR) 10Ottomata: [C: 032 V: 032] Update varnishkafka module with should_subscribe fix [puppet] - 10https://gerrit.wikimedia.org/r/283488 (owner: 10Ottomata) [17:35:48] (03PS2) 10Cmjohnson: Adding graphite1003 mac to dchpd file [puppet] - 10https://gerrit.wikimedia.org/r/283486 [17:37:01] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2207677 (10matmarex) [17:37:03] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2207676 (10matmarex) [17:37:32] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204871 (10matmarex) [17:39:33] (03PS1) 10Ottomata: Set $kafka_brokers in role::cache::kafka [puppet] - 10https://gerrit.wikimedia.org/r/283489 [17:39:35] (03CR) 10Cmjohnson: [C: 032] Adding graphite1003 mac to dchpd file [puppet] - 10https://gerrit.wikimedia.org/r/283486 (owner: 10Cmjohnson) [17:40:29] ah cmjohnson sorry [17:40:31] ottomata: somewhere in the mix of all your changes is my small change [17:40:32] i have some unmerged changes [17:40:39] no worries...just merge when you get a chacne [17:40:45] yeah, i'm about to merge in a sec, as i was reviewing the puppet-merge i noticed something weird [17:40:50] am fixing before i fully merge [17:40:52] i gotcha [17:40:57] cool...no hurry [17:43:10] (03CR) 10Ladsgroup: "Untested, let me check it in depth and then we can +2 :)" [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [17:43:32] ottomata: kafka config applied? [17:43:34] all good? [17:43:42] we're safe? [17:43:59] mobrovac: i noticed the role::kafka::cache change you made [17:44:03] i didn't run on a cache hosts [17:44:20] (03CR) 10Smalyshev: [C: 031] "yay!" [puppet] - 10https://gerrit.wikimedia.org/r/283485 (https://phabricator.wikimedia.org/T132387) (owner: 10Gehel) [17:44:20] that breaks in two ways: one, the instance role classes use that variable, we need that [17:44:20] also [17:44:29] 'analytics' shouild == 'eqiad' everywhere [17:44:30] not just in eqiad [17:44:47] so [17:44:47] https://puppet-compiler.wmflabs.org/2459/cp2001.codfw.wmnet/ [17:44:49] fixing that [17:45:22] (03PS2) 10Ottomata: Set $kafka_brokers in role::cache::kafka, analytics == eqiad in all sites [puppet] - 10https://gerrit.wikimedia.org/r/283489 [17:45:56] ottomata: also, it seems we accidentally f*cked up varnishkafka submobule [17:46:07] fixed i think [17:46:10] https://gerrit.wikimedia.org/r/#/c/279280/19/modules/varnishkafka,unified [17:46:12] kk [17:47:23] phew, better [17:47:23] ok [17:47:24] https://puppet-compiler.wmflabs.org/2460/cp2001.codfw.wmnet/ [17:48:28] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [17:48:31] jajaja [17:49:09] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [17:49:37] (03CR) 10Ottomata: [C: 032] Set $kafka_brokers in role::cache::kafka, analytics == eqiad in all sites [puppet] - 10https://gerrit.wikimedia.org/r/283489 (owner: 10Ottomata) [17:49:41] (03PS3) 10Ottomata: Set $kafka_brokers in role::cache::kafka, analytics == eqiad in all sites [puppet] - 10https://gerrit.wikimedia.org/r/283489 [17:49:44] (03CR) 10Ottomata: [V: 032] Set $kafka_brokers in role::cache::kafka, analytics == eqiad in all sites [puppet] - 10https://gerrit.wikimedia.org/r/283489 (owner: 10Ottomata) [17:49:53] (03CR) 10Ottomata: [V: 032] Set $kafka_brokers in role::cache::kafka, analytics == eqiad in all sites [puppet] - 10https://gerrit.wikimedia.org/r/283489 (owner: 10Ottomata) [17:50:08] (03CR) 10DCausse: [C: 031] Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [17:51:55] ok, merging mobrovac [17:52:11] running puppet on a bunch of affected hsots [17:52:11] ... [17:52:16] you're merging me or the patch? [17:52:17] :D [17:52:18] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:52:21] haha [17:52:27] U R THE PATCH [17:52:43] (03PS1) 10Alexandros Kosiaris: uwsgi: Add python3 support [puppet] - 10https://gerrit.wikimedia.org/r/283492 [17:52:51] i'm more than a patch, i'm a whole jacket! [17:52:59] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:53:38] hmm, mobrovac a good improvement would be to sort the brokers in the array ahead of time [17:53:53] i just noticed that one of the templates doesn't sort, so it did cause a change, but only reordred brokers [17:54:04] (03PS1) 10Cmjohnson: Adding dns for graphite1003 [dns] - 10https://gerrit.wikimedia.org/r/283493 [17:54:04] aside from that looks perfect! [17:54:09] yay! [17:54:38] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail [17:54:45] awsesooome [17:56:32] (03CR) 10Cmjohnson: [C: 032] Adding dns for graphite1003 [dns] - 10https://gerrit.wikimedia.org/r/283493 (owner: 10Cmjohnson) [17:57:39] (03CR) 10Ladsgroup: [C: 031] uwsgi: Add python3 support [puppet] - 10https://gerrit.wikimedia.org/r/283492 (owner: 10Alexandros Kosiaris) [17:57:56] oh mobrovac yhour are sorting [17:57:57] hm. [17:58:11] i guess the sort is just different [17:58:12] hm [17:59:24] hm [18:01:05] godog: hi there [18:01:18] feel like finishing with https://phabricator.wikimedia.org/T96132 ? :) [18:01:20] hi matanya [18:02:20] hm ja strange, the next puppet runs are no change mobrovac, soooo dunno! but i guess its fine [18:02:43] (03CR) 10Mobrovac: [C: 04-1] ores: Scap3 deployment configurations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [18:03:10] urandom, godog: our 51 cassandra instances account for over 2 million graphite metrics, and over 62% of *all* metrics [18:03:13] ottomata: you should have merged the patch instead of me and all would be good from the start :D [18:03:20] matanya: hehe could be puppet swat material! [18:03:37] is that happening today ? [18:03:45] and mobrovac ^^ I guess :) [18:03:55] haha [18:04:01] is there anything we can do to cut down on these metrics? [18:04:10] I seriously doubt 2 million metrics are useful [18:04:28] we have 51 cassandra instances :o [18:04:31] paravoid: uninstall cass? :P [18:04:31] ? [18:04:35] * mobrovac trolling [18:04:48] ottomata: I guess so! [18:04:54] this is nuts [18:05:02] paravoid: yeah, the biggest is per-columnfamily usage, each columnfamily has 'meta' and 'data', I think we could cut down on the 'meta' ones [18:05:19] +1 [18:05:23] meta is useless [18:06:03] oh, i missed todays, too bad, will schedule for Tuesday [18:06:49] can we do that? [18:07:01] that's around half of that, ~1 mil metrics [18:07:32] paravoid: yeah, something similar to what we did in https://phabricator.wikimedia.org/T113733 [18:09:06] matanya: what's missing though? the patch is merged [18:09:20] ./restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_wikiversity_T_parsoid_html/meta/CasPrepareLatency/50percentile.wsp [18:09:23] ./restbase1001/org/apache/cassandra/metrics/ColumnFamily/local_group_wikiversity_T_parsoid_html/meta/CasPrepareLatency/median.wsp [18:09:26] uhh [18:09:28] isn't 50percentile == median? [18:09:56] godog: so why is it open ? [18:10:08] also do we really need 50p, 75p, 95p, 99p, mean, min, max, count, 1MinuteRate and 5MinuteRate for each metric? [18:10:32] paravoid: check the mtime for those, I think some might be old [18:10:38] also, we seem to have restbase1008, then restbase1008-a & restbase1008-b, are all three in use? [18:11:32] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2207785 (10Volans) Thanks to @Papaul replaced 2 failed disks on db2017 and 1 on db2018. Keeping the last spare disk as spare for now in case another DB disk breaks in the next days, db2023 is o... [18:11:34] matanya: I don't know, I'll make a note to check tomorrow across the fleet [18:12:14] paravoid: yeah some of those are old and can go [18:12:20] tahnks godog [18:12:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [18:12:51] 406k seem to be > 90 days old, another 85k 30-90d [18:13:02] can we clean those up? [18:13:19] another 115k 7d-30d [18:14:03] (03PS1) 1020after4: scap: add deployment configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/283494 (https://phabricator.wikimedia.org/T114363) [18:14:26] (03CR) 10jenkins-bot: [V: 04-1] scap: add deployment configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/283494 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [18:14:51] the very old ones I'd say so, yeah [18:17:15] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2207820 (10Papaul) Drive replacement on db2017 slot 11 and slot 2 Drive replacement on db2018 slot 3 [18:19:57] <_joe_> I would go on record and say that if we need to collect more than 100 metrics per instance, we are doing something wrong [18:20:43] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:22:02] (03PS2) 10Chad: Try to tune back ldap logging a tad. Rather spammy. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283259 [18:22:08] jouncebot: next [18:22:08] In 0 hour(s) and 37 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1900) [18:22:21] (03CR) 10Chad: [C: 032] Try to tune back ldap logging a tad. Rather spammy. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283259 (owner: 10Chad) [18:22:25] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2207838 (10Papaul) p:05Triage>03Normal [18:22:47] (03Merged) 10jenkins-bot: Try to tune back ldap logging a tad. Rather spammy. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283259 (owner: 10Chad) [18:23:52] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: ldap logging tune (duration: 00m 34s) [18:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:43] (03PS4) 10Chad: Gerrit: replicate git repositories to new home [puppet] - 10https://gerrit.wikimedia.org/r/282761 [18:26:09] (03PS2) 10Chad: Remove expired old staticy things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282983 [18:26:14] (03CR) 10Chad: [C: 032] Remove expired old staticy things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282983 (owner: 10Chad) [18:26:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:26:45] (03Merged) 10jenkins-bot: Remove expired old staticy things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282983 (owner: 10Chad) [18:27:32] (03PS2) 1020after4: scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/283494 (https://phabricator.wikimedia.org/T114363) [18:28:43] (03PS1) 10Faidon Liambotis: graphite: space-out the graphite-index generation [puppet] - 10https://gerrit.wikimedia.org/r/283496 [18:28:58] (03CR) 1020after4: [C: 031] "Fixed the problem that was causing this to fail on tin in previous merge attempts." [puppet] - 10https://gerrit.wikimedia.org/r/283494 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [18:29:28] (03CR) 10Faidon Liambotis: [C: 032 V: 032] graphite: space-out the graphite-index generation [puppet] - 10https://gerrit.wikimedia.org/r/283496 (owner: 10Faidon Liambotis) [18:30:34] _joe_: I think varnishstat metrics in ganglia export more than that heh [18:31:20] <_joe_> bblack: uhm let me check :P [18:31:30] !log demon@tin Started scap: sync some symlink removals [18:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:42] 477 metrics from raw varnishstat, total between a text-frontend + text-backend daemon on a single host [18:31:51] I don't think all of them go to ganglia, but a lot do [18:32:11] <_joe_> bblack: it's around 250 combined [18:32:21] <_joe_> it's still two instances [18:32:47] <_joe_> it's still 2 orders of magnitude less than what cassandra is exporting, more or less [18:32:52] lol [18:32:53] <_joe_> probably more [18:33:04] <_joe_> and the cassandra metrics are per instance [18:33:10] <_joe_> we have up to 3 per server [18:33:54] oh right, the whole workaround java by running multiple clusters across one set of hosts thing, I only vaguely remember what went into that [18:34:27] (03CR) 10Chad: "@qchris This look ok to go?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [18:34:33] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2207848 (10Papaul) I reset the IDRAC and upgrade the firmware from 1.7 to 2.8, but having the same issue can not redirect to com2. All the settings in the BIOS for the... [18:36:01] bblack _joe_ container settings is almost finished, still planning to resume with the switch once that's finished [18:36:24] !log demon@tin Finished scap: sync some symlink removals (duration: 04m 54s) [18:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:47] godog: ok [18:37:20] (03PS1) 10Dereckson: Add images.unsplash.com to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283497 (https://phabricator.wikimedia.org/T132701) [18:37:55] <_joe_> godog: I'm around [18:38:24] (03PS1) 10Chad: Group2 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283498 [18:39:25] _joe_: where does this 100 metrics/instance figure come from? [18:40:03] <_joe_> urandom: randomly out of my ass, of course [18:40:29] i must confess to feeling some culture shock on this issue [18:40:54] <_joe_> urandom: 100 is of course pretty low, but 25k metrics/instance seems a bit too many [18:41:42] i'm accustomed to putting a premium on data like this; that the cost of storage is low enough, and the potential for value so high, that'd you can cast a wide net [18:42:00] it's not always obvious what is going to be of value ahead of time [18:42:37] (03CR) 10Ladsgroup: ores: Scap3 deployment configurations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [18:42:40] 06Operations, 13Patch-For-Review, 15User-mobrovac: Replace role::kafka::*::config classes with puppet functions. - https://phabricator.wikimedia.org/T130371#2207867 (10Ottomata) 05Open>03Resolved WE goOOood! Thanks Marko! [18:43:54] _joe_: so it strikes me as odd when Cassandra is considered aberrant for exporting too many metrics, when it "fewer" is better [18:44:08] s/when it/when/ [18:45:04] ok finished with containers, switching thumbs to codfw [18:45:17] ostriches: FYI ^ though it shouldn't impact the train [18:45:21] (03PS2) 10Jdrewniak: Updating portals to master. removing top-links A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283481 (https://phabricator.wikimedia.org/T124116) [18:46:30] _joe_: there are lots of these metrics that are of limited usefulness beyond the near term [18:46:56] _joe_: some of them are interesting to look at months or even years later, but not many [18:47:23] <_joe_> urandom: If so, graphite is probably not the best place to store those [18:47:44] but i guess that applying per-metric retention is probably too costly in other terms [18:47:48] maybe not [18:50:13] 06Operations, 06Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#2207876 (10Andrew) 05declined>03Open This ticket has a terrible, unclear title, and even after reading the ticket I'm not 100% sure what it's about. I'm pretty sure that this bug is about... [18:50:23] (03PS1) 10Filippo Giunchedi: Revert "varnish: move thumbs back to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/283501 [18:50:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "varnish: move thumbs back to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/283501 (owner: 10Filippo Giunchedi) [18:51:18] !log forcing puppet run on cache_upload in eqiad and codfw [18:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:51:51] (03Abandoned) 10Dzahn: swift: include admin in role classes [puppet] - 10https://gerrit.wikimedia.org/r/283373 (owner: 10Dzahn) [18:54:14] <_joe_> urandom: we should probably pick the metrics we want to look at for performance regularly and save them on graphite, and archive the rest on some more suitable datastore maybe [18:54:27] <_joe_> urandom: can't we write cassandra metrics on cassandra? :P [18:54:40] * _joe_ hides [18:55:24] (03CR) 10Dzahn: [C: 031] "thank you very much, so we have ./common/swift, role/common/swift/eqiad_prod/ but we need role/eqiad/swift etc ?:) can be confusing :)" [puppet] - 10https://gerrit.wikimedia.org/r/283471 (https://phabricator.wikimedia.org/T130910) (owner: 10Andrew Bogott) [18:55:55] (03CR) 10Dzahn: "should i amend to remove what i added then?" [puppet] - 10https://gerrit.wikimedia.org/r/283471 (https://phabricator.wikimedia.org/T130910) (owner: 10Andrew Bogott) [18:56:09] finished with forced puppet, moving codfw to direct [18:56:33] (03PS4) 10Filippo Giunchedi: varnish: switch upload codfw from 'eqiad' to 'direct' [puppet] - 10https://gerrit.wikimedia.org/r/282891 (https://phabricator.wikimedia.org/T129089) [18:56:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: switch upload codfw from 'eqiad' to 'direct' [puppet] - 10https://gerrit.wikimedia.org/r/282891 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [18:58:19] !log force puppet run on cache_upload in codfw [18:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:05] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T1900). [19:00:12] (03CR) 10Chad: [C: 032] Group2 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283498 (owner: 10Chad) [19:00:47] (03Merged) 10jenkins-bot: Group2 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283498 (owner: 10Chad) [19:01:15] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.21 [19:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:48] ostriches: can you let me know once you are done with the train? [19:03:10] I'm still keeping an eye on logs but the train's technically complete. [19:03:48] 06Operations: Replace role::kafka::*::config classes with puppet functions. - https://phabricator.wikimedia.org/T130371#2207935 (10mobrovac) [19:05:15] ok, I'll depool upload/eqiad shortly [19:06:16] (03PS1) 10Dzahn: Revert "swift/admin: set admin group per dc and role" [puppet] - 10https://gerrit.wikimedia.org/r/283505 [19:06:17] bblack: does https://gerrit.wikimedia.org/r/#/c/283416/1 look ok? [19:06:39] (03CR) 10jenkins-bot: [V: 04-1] Revert "swift/admin: set admin group per dc and role" [puppet] - 10https://gerrit.wikimedia.org/r/283505 (owner: 10Dzahn) [19:06:41] (03CR) 10BBlack: [C: 031] depool upload/eqiad for codfw switchover [dns] - 10https://gerrit.wikimedia.org/r/283416 (owner: 10Filippo Giunchedi) [19:06:47] godog: yup [19:07:23] <_joe_> I'm unsure I get why we should do this [19:07:40] <_joe_> we still want people nearer to eqiad to go through eqiad => codfw, right? [19:08:02] we want to partially test how it is going to look like on tues [19:08:09] <_joe_> or we don't want to transform eqiad in a tier-two caching dc? [19:08:25] <_joe_> godog: what are you testing exactly? [19:08:48] _joe_: he's doing the same thing we'll do for the other clusters next week [19:08:59] simulate as nearly as we can "running without eqiad", including at the cache layer [19:09:09] <_joe_> oh ok [19:09:15] <_joe_> cool [19:09:28] so no users directly to eqiad, and no other caches backending to eqiad either [19:09:56] (only for cache_upload today, for all the rest next week) [19:10:00] (03PS2) 10Filippo Giunchedi: depool upload/eqiad for codfw switchover [dns] - 10https://gerrit.wikimedia.org/r/283416 [19:10:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] depool upload/eqiad for codfw switchover [dns] - 10https://gerrit.wikimedia.org/r/283416 (owner: 10Filippo Giunchedi) [19:11:02] !log depool upload/eqiad [19:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:34] (03CR) 10Mobrovac: ores: Scap3 deployment configurations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [19:13:18] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2207998 (10Krenair) To be clear, this is a system that I noticed had issues when I got labtest-roots so I opened a ticket to ensure it was not forgotten. It's not a pr... [19:14:46] (03PS2) 10Dzahn: Revert "swift/admin: set admin group per dc and role" [puppet] - 10https://gerrit.wikimedia.org/r/283505 [19:19:32] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:20:44] I'll wait for eqiad to be fully drained before flipping esams to codfw [19:22:26] it's not necessary really [19:22:39] (03PS2) 10BBlack: codfw switch: esams text caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283431 [19:22:41] (03PS2) 10BBlack: codfw switch: codfw text caches -> direct [puppet] - 10https://gerrit.wikimedia.org/r/283430 [19:22:43] (03PS2) 10BBlack: codfw switch: eqiad text caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283432 [19:23:37] ah, heh possibly OCD to see the change in graphs when things move over [19:24:42] ok moving esams [19:25:00] (03PS2) 10Filippo Giunchedi: varnish: switch esams from 'eqiad' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/283418 (https://phabricator.wikimedia.org/T129089) [19:25:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: switch esams from 'eqiad' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/283418 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [19:25:45] (03PS2) 10BBlack: codfw switch: geodns depool text services from eqiad [dns] - 10https://gerrit.wikimedia.org/r/283433 [19:26:09] !log force puppet run for cache_upload in esams [19:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:38] looks like it is working, upload caches in eqiad are at zero traffic now [19:31:58] <_joe_> cool [19:32:26] _joe_: yes, yes we can: http://opennms.github.io/newts/ [19:32:30] :P [19:32:39] last piece, eqiad to codfw [19:32:58] (03PS3) 10Filippo Giunchedi: varnish: switch upload eqiad from 'direct' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/282892 (https://phabricator.wikimedia.org/T129089) [19:33:05] <_joe_> godog: the load on the swift cluster in codfw is going up steadily [19:33:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: switch upload eqiad from 'direct' to 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/282892 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [19:33:39] <_joe_> what exactly did we change in the last hour? [19:33:51] I would hope it is :) [19:34:37] ignoring the traffic-layer-internal things, we switched upload.wm.o cache misses from hitting ms-fe.eqiad to hitting ms-fe.codfw [19:35:17] !log force puppet run on cache_upload in eqiad [19:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:35:50] the traffic-layer stuff: users are no longer hitting eqiad directly (trailing minority with broken DNS aside), but still hitting all 3 other sites directly. [19:36:11] and before inter-cache was ulsfo->codfw, codfw->eqiad, esams->eqiad, eqiad->Swift [19:36:24] now it is ulsfo->codfw, codfw->Swift, esams->codfw, eqiad->codfw [19:37:33] \o/ look at that, eqiad upload at zero traffic and it works [19:38:40] the ms-fe switch was quite some time back now, though, the cache stuff more-recent [19:39:11] looks like 16:20 for ms-fe switch in graphs? [19:39:32] which is back when the thumbs 401 thing came up [19:39:52] yeah, then I reverted only thumbs, originals stayed in codfw [19:40:00] ah ok, that makes sense now [19:40:46] 06Operations, 10Traffic, 10pywikibot-compat, 10pywikibot-core, 07HTTPS: pywikibot support for https-only - https://phabricator.wikimedia.org/T102315#2208073 (10Andrew) As I understand it, this ticket is a request for updates to the pywikibot code. I'm going to remove the Operations tag; please re-add wi... [19:41:39] bblack: looks like the esams traffic to/from eqiad is ~250MB/s in either direction, looking at https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20caches%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1460662201&g=network_report&z=large [19:43:55] that seems a little higher than I expect, but not completely unreasonable [19:44:51] we'll get to fix some things about upload caching a bit better once varnish4 hits, though [19:45:08] about large objects, range-reqs, and when to pass and/or stream [19:45:30] 06Operations, 10RESTBase-Cassandra: Grafana bugginess; Graph scales sometimes off by an order of magnitude - https://phabricator.wikimedia.org/T121789#2208099 (10Andrew) p:05Triage>03Normal [19:45:39] nice [19:46:54] 06Operations, 10Traffic, 10Wikimedia-IRC-RC-Server, 07HTTPS, and 2 others: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933#2208105 (10Andrew) p:05Triage>03Normal [19:47:18] ok I'm off, pager is with me in case, thanks _joe_ bblack ! [19:49:36] cya, good work :) [19:50:27] 06Operations, 10Wikimedia-General-or-Unknown: Connection to Wikimedia projects slow/timing out for some users - https://phabricator.wikimedia.org/T124417#2208124 (10Andrew) 05Open>03Resolved a:03Andrew Presumed fixed. @Samtar, please re-open if it's still happening. [19:52:43] for context since I was looking around after godog's 250MB/s comment: [19:52:46] http://snag.gy/qF6Th.jpg [19:52:49] 06Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#2208138 (10Andrew) p:05Triage>03High [19:53:19] that's reqs/sec into the front edge of the upload caches in esams, vs reqs/sec out the back of the esams upload caches towards eqiad/codfw and eventually Swift (what has to go over the network links back to the US) [19:54:26] <_joe_> bblack: wow [19:55:48] 06Operations, 06Discovery, 10Kartotherian, 10Maps, 03Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2208142 (10Andrew) p:05Triage>03High @Tfinc, this is just a drive-by, but I think that the correct next step is to open a subtask of this ticket w... [19:57:24] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2208148 (10Andrew) p:05Triage>03Normal [19:58:02] 06Operations, 10hardware-requests: rack and set up graphite1003 - https://phabricator.wikimedia.org/T132717#2208150 (10Cmjohnson) [19:58:38] (03PS3) 10Dzahn: Revert "swift/admin: set admin group per dc and role" [puppet] - 10https://gerrit.wikimedia.org/r/283505 [19:58:38] 06Operations, 10DBA: Investigate/decom db2001-db2007 - https://phabricator.wikimedia.org/T125827#2208172 (10Andrew) p:05Triage>03Normal [19:59:00] (03CR) 10Dzahn: "merge this together for cleanup: https://gerrit.wikimedia.org/r/#/c/283505/" [puppet] - 10https://gerrit.wikimedia.org/r/283471 (https://phabricator.wikimedia.org/T130910) (owner: 10Andrew Bogott) [19:59:49] (03CR) 10Andrew Bogott: [C: 04-2] "I have removed (or am in the process of removing) those files entirely, so this patch isn't needed." [puppet] - 10https://gerrit.wikimedia.org/r/283505 (owner: 10Dzahn) [20:01:03] (03CR) 10Dzahn: [C: 031] "lgtm, ack from moritz would be great" [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [20:05:39] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [20:06:07] 06Operations, 06Discovery, 10Kartotherian, 10Maps, 03Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2208192 (10Tfinc) @Gehel Given that this is approved by Jaime and in plan pending FDC approval do you want close it out and re-open when we finalize o... [20:06:37] (03CR) 10Ladsgroup: ores: Scap3 deployment configurations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [20:07:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2208204 (10Dzahn) Thank you, i have also uploaded a patch that removes my former attempts to clean it up. Please merge both together. [20:11:41] (03CR) 10Dzahn: "ok, yes. abandoning. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/283505 (owner: 10Dzahn) [20:12:02] (03Abandoned) 10Dzahn: Revert "swift/admin: set admin group per dc and role" [puppet] - 10https://gerrit.wikimedia.org/r/283505 (owner: 10Dzahn) [20:12:46] (03CR) 10Dzahn: [C: 031] Prune another dead hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/283476 (owner: 10Andrew Bogott) [20:13:08] (03CR) 10Dzahn: "yep, i added this and nothing happened" [puppet] - 10https://gerrit.wikimedia.org/r/283476 (owner: 10Andrew Bogott) [20:14:09] (03PS5) 10Dzahn: Gerrit: replicate git repositories to new home [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [20:14:34] 06Operations, 06Discovery, 10Kartotherian, 10Maps, 03Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2208218 (10Gehel) The related hardware requests are T131180 and T131880. So yes, I think we can close this for the moment. [20:14:53] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2208223 (10Gehel) [20:14:55] 06Operations, 06Discovery, 10Kartotherian, 10Maps, 03Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2208222 (10Gehel) 05Open>03Resolved [20:14:59] (03PS1) 10Chad: Move etherpad diamond collector script to module [puppet] - 10https://gerrit.wikimedia.org/r/283512 [20:15:01] (03CR) 10Dzahn: [C: 032] Gerrit: replicate git repositories to new home [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [20:15:43] mutante: Oh thx ^ [20:15:44] :) [20:16:16] ostriches: ready to run it on lead ?:) [20:16:28] Yeah lemme do it [20:16:47] ok, merging on master [20:17:12] done [20:18:45] Ok, and now to run on ytterbium [20:20:07] (03CR) 10Chad: "Puppet compiler says yay! https://puppet-compiler.wmflabs.org/2461/" [puppet] - 10https://gerrit.wikimedia.org/r/283512 (owner: 10Chad) [20:21:41] Reloading replication plugin... [20:22:23] oh, you are killing stuff from ./files. LIKE! [20:22:45] "Algorithm negotiation fail" [20:22:47] (03PS2) 10Dzahn: Move etherpad diamond collector script to module [puppet] - 10https://gerrit.wikimedia.org/r/283512 (owner: 10Chad) [20:22:48] Hmmm [20:22:50] uh? [20:22:55] ytterbium? [20:25:18] ostriches: lead doesnt have base::firewall yet [20:25:47] Yeah ytterbium can't connect yet [20:26:02] looking on lead [20:26:20] Hmm, it worked just using `ssh` myself. [20:26:28] so it's 2 things [20:26:38] we need to add base firewall but it should be open now [20:28:10] I wonder why gerrit can't connect. [20:28:22] there are no rules , iptables -L [20:29:12] but let's add them [20:29:39] Gerrit shouldn't have needed a restart for this. [20:29:44] *puzzles* [20:29:45] ah [20:30:04] remembers each config change restarts it, doesnt it [20:30:16] replication config changes don't. [20:30:24] *nod* [20:30:25] Because it just needs a plugin reload, not a restart. [20:31:23] !log gerrit restarting [20:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:47] heh, in that precise moment i tried to use git review [20:32:21] error: Could not make fetch happen [20:32:32] *snicker* [20:32:51] https://cdn.meme.am/instances/23367808.jpg [20:33:04] yes :) [20:33:12] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:34:52] Well *that* didn't help [20:35:13] what happened? [20:35:35] Problem didn't go away, and restarting means it wants to replicate a lot of things now :P [20:36:23] Oh. [20:36:25] I remember [20:36:36] ok, let's check the lead part again.. yes? [20:36:50] https://gerrit.wikimedia.org/r/#/c/283537/ btw [20:37:20] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:21] that will add all the rules at once, defaults and the holes from the role [20:38:53] I have to add lead's rsa key to gerrit's known_hosts. [20:40:24] aha [20:41:13] i'll just go ahead with that lead change, k? [20:42:00] Yeah [20:42:12] That known_hosts thing has tripped me up every single time. [20:44:13] Still no dice :\ [20:44:14] wtf. [20:44:31] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:44:49] does it just timeout ? [20:45:22] [2016-04-14 20:43:35,870] ERROR com.googlesource.gerrit.plugins.replication.ReplicationQueue : Cannot replicate to gerritslave@lead.wikimedia.org:/srv/gerrit/git/operations/puppet.git [20:45:24] org.eclipse.jgit.errors.TransportException: gerritslave@lead.wikimedia.org:/srv/gerrit/git/operations/puppet.git: Algorithm negotiation fail [20:46:12] ok, so lead is jessie [20:46:20] when this worked before, that was not jessie , right [20:46:33] and negotiation fail.. hmm [20:46:41] jgit version? [20:47:59] andrewbogott, mutante, did the host key for bast1001.wikimedia.org change? [20:48:15] I get the warning of a change, and it says: [20:48:16] matt_flaschen: yes. There was an email to the ops list with details. [20:48:21] mutante: I dunno, whichever one we've had bundled in there for ages. [20:48:28] matt_flaschen: I can forward to you if you like [20:48:42] andrewbogott, what's the subject? [20:48:53] it changed twice [20:49:10] "scheduled downtime for bast1001 tomorrow 1800 UTC" [20:49:33] matt_flaschen: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast1001.wikimedia.org [20:49:43] mutante, andrewbogott, thanks. [20:52:11] ostriches: http://stackoverflow.com/questions/29797017/algorithm-negotiation-fail-error-in-eclipse-when-trying-to-connect-by-ssh-to-a [20:52:21] there was also a wikitech-l email [20:52:48] "Error: Server does not support diffie-hellman-group1-sha1 for keyexchange" [20:53:07] ostriches: that could be it , the kex config, sshd [20:53:33] modules/ssh/templates/sshd_config.erb:<%- if @disable_nist_kex -%> [20:53:51] it is different per distro [20:54:14] KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256 [20:54:27] that would fit Algo Negotiate fail [20:55:09] Aha! [20:55:22] yea, lead has that line [20:55:33] We hit something similar before... [20:55:48] we can test it by flipping disable_nist_kex [20:56:05] On lead? [20:56:05] yes, indeed [20:56:09] yea [20:56:20] well, in puppet [20:56:36] Ok that'll work for now. This is temporary until we can get off of the old box [20:57:03] ok, hold on [20:57:17] hieradata/hosts/gallium.yaml:ssh::server::disable_nist_kex: false [20:57:17] hieradata/hosts/antimony.yaml:ssh::server::disable_nist_kex: false [20:57:19] hah [20:57:37] that is what you remembered i think [20:58:17] there is also ssh::server::explicit_macs: false [20:58:23] Krenair: can https://phabricator.wikimedia.org/T125748 be closed? And/or does it really need the Operations tag? [20:59:40] andrewbogott, I think I added that because of https://gerrit.wikimedia.org/r/#/c/267816/ [20:59:54] I'll remove it, doesn't make a lot of sense [20:59:59] 06Operations, 07Diamond, 07Upstream: Diamond load averages do not contain scaled versions - https://phabricator.wikimedia.org/T125411#2208372 (10Andrew) 05Open>03stalled p:05Triage>03Normal [21:00:20] Krenair: ok, thanks [21:00:45] 06Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 3 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2208375 (10Andrew) p:05Triage>03High [21:01:02] grrrit-wm restart maybe? [21:01:27] 06Operations, 07Icinga: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2208378 (10Andrew) p:05Triage>03Normal a:03Dzahn [21:01:31] ostriches: i think the bot needs kicking after gerrit restart [21:01:47] i'm testing the change now [21:01:57] 06Operations: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#2208380 (10Andrew) p:05Triage>03Normal a:03Dzahn [21:02:52] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:03] 06Operations, 06Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#2208382 (10Andrew) p:05Triage>03Normal a:03Dzahn [21:05:43] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2208402 (10Andrew) p:05Triage>03Normal [21:06:05] Hmm, still no good on ytterbium side [21:06:26] 06Operations, 10Architecture, 10DBA: Architecture decision to solve the need larger serves (for better capacity and consolidation) vs. more, smaller servers (for high availability) - https://phabricator.wikimedia.org/T124681#2208403 (10Andrew) p:05Triage>03Normal [21:06:53] ostriches: wait, wasnt applied [21:07:07] Oh, I hadn't done anything was just still tailing the log [21:08:20] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2208420 (10Dzahn) a:03Dzahn [21:08:38] 06Operations, 06Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#2208423 (10Dzahn) a:05Dzahn>03None [21:09:44] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2208425 (10Andrew) p:05Triage>03Normal [21:10:41] hrmm, the expected change did not happen, what [21:12:02] ostriches: try now? [21:15:33] "reject host key" [21:15:36] Ok, progress. [21:15:40] I added it to the file, hmm [21:15:41] yay [21:16:09] so it's really the KexAlgorithms [21:16:26] just that puppet did not remove it as expected but i did..hmm [21:16:38] tests more if puppet adds it back or not [21:16:41] reject host key is usually the known_hosts issue. [21:16:55] I copied lead's ssh_host_rsa_key.pub over, hmm [21:17:40] do you have one that isnt RSA [21:17:42] 62 RhostsRSAAuthentication no [21:17:50] ecdsa? [21:19:42] There's a couple on lead. [21:19:48] ok, so puppet removed that line now (as we wanted it to) [21:20:09] lets try ecdsa [21:20:20] yes, there is rsa, dsa, ecdsa and that ed5519 or so [21:20:24] yes [21:22:42] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:22:50] Still rejecting the host key :\ [21:22:51] boo [21:23:34] give it one more try now [21:23:39] i tried commenting another line [21:23:53] even though it seems unlikely with that error message [21:24:02] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:24:04] but it's the other hardening thing that we disabled on gallium [21:24:12] explicit MACs [21:24:57] ugh, eh, only now [21:24:59] Back to algorithm negotiation fails. [21:25:37] still ? [21:25:42] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:51] Yeah [21:26:30] eh, odd, it should be like it was just before that [21:26:44] disabled puppet [21:27:30] again ? [21:28:29] how do you test it on ytterbium, could i run that too? [21:28:49] Back to rejected host key [21:28:59] ok [21:29:02] I've just got 2 replication jobs that are stuck in their retry loop [21:29:11] I'm just tailing gerrit's error_log to see when/how they fail [21:29:19] ah, easy enough [21:30:15] ok, rejected host key is on ytterbium's side, lead should be ok. [21:30:27] We just gotta get the right key into known_hosts and restart gerrit again [21:30:35] ok! :) [21:31:15] bd808: would you be willing to baby sit https://gerrit.wikimedia.org/r/#/c/283555/ through SWAT? [21:31:19] except, it seems that the change we need doesnt work with puppet yet [21:31:39] the hiera settings -> config thing [21:37:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:38:41] ostriches: unfortunately the etherpad/diamond thing has an issue in prod, even though you tested in compiler.. [21:38:54] Grr :( [21:39:04] yea, i'm really surprised why [21:39:14] it compiled fine , i saw all that [21:39:43] puppet:///modules/etherpad/files/etherpad.py [21:40:17] that's not in that place [21:40:43] but where you just moved it.. wtf [21:40:51] did gerrit die? [21:41:06] looks like it [21:41:09] the bot did because gerrit was restarted, gerrit itself shouldn't [21:41:09] now it's back [21:41:30] jzerebecki: I can't tonight. I have an "event" to go to [21:41:54] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2208536 (10Volans) Rebuild halfway trough, I've disabled notification for db2017 and db2018 to avoid to be awaken in the middle of the night by the recovery. I'll re-activate them tomorrow afte... [21:42:01] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:18] ostriches: ah, i know why. fixing [21:43:48] ostriches: I've got an "urgent" request from multchill, thedj and others to deploy a patch to ContentTranslation -- https://gerrit.wikimedia.org/r/#/c/283570/ [21:43:51] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:43:56] !log csteipp@tin Synchronized wmf-config/InitialiseSettings-labs.php: Syncing labs config change (duration: 00m 34s) [21:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:18] ostriches: If I can get gerrit and friends to cooperate can I push it out? [21:44:20] bd808: Fine by me. [21:44:28] I'll stop restarting it again and again [21:44:29] For a bit [21:44:44] !log csteipp@tin Synchronized wmf-config/CommonSettings-labs.php: Syncing labs config change (duration: 00m 27s) [21:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:58] csteipp: ping me when you are done please :) [21:45:08] bd808: All done [21:46:17] csteipp: thanks. [21:46:32] bd808: thedj might have gone to bed [21:46:38] * bd808 pushes buttons and crosses fingers [21:46:58] multichill: k. I was just pinging randomly based on the bug [21:48:18] ostriches: etherpad/diamond thing is fixed. that was just that thing where puppet doesnt want "files" to be in the path when it's inside a module. [21:48:27] oh whoops [21:48:28] thx [21:48:28] it's a bit random [21:49:31] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:50:02] * bd808 stares intently at zuul [21:51:00] * ostriches finds something stabby shaped for gerrit [21:51:35] bd808: Was saying it to thedj early. Feels like Wiki Loves Monuments. Writing code at the last moment, custom javascript to compensate for lack of ux and last minute deploys [21:51:42] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2208542 (10Dzahn) Is this just about adding https or also about enforcing it? [21:52:20] multichill: life on the wikis ;) [21:52:21] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:53:28] I'm going to guess we have ~3 more minutes until jerkins lets it merge [21:54:55] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2208547 (10Dzahn) and would this be about a certificate on carbon or can it be behind varnish even though it's an APT mirror? [21:57:22] multichill: syncing now [21:57:43] !log bd808@tin Synchronized php-1.27.0-wmf.21/extensions/ContentTranslation: Enable europeana2802016 campaign (T125626) (duration: 00m 34s) [21:57:43] T125626: Configure europeana2802016 ContentTranslation campaign - https://phabricator.wikimedia.org/T125626 [21:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:58:24] multichill: no more angry error message for me [21:58:29] can you double check [21:59:06] Will do [21:59:43] gilles: you have access on swift machines now (or as soon as puppet ran on all), checked on ms-be1001 [21:59:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2208563 (10Andrew) 05Open>03Resolved [22:00:01] multichill: Yup, works :-D [22:00:09] w00t [22:00:11] Tried with https://la.wikipedia.org/w/index.php?title=Special:ContentTranslation&campaign=europeana2802016&page=Kr%C3%A1sn%C3%A1+Madona+z+Kru%C5%BClowe&from=cs&to=la [22:00:42] "Content Translation has been activated in this wiki." [22:01:01] .....test and give feedback about new features before we launch them as default behavior. Try out something new now!" [22:01:24] Much appreciated! [22:02:06] multichill: yw [22:02:12] I'll add some notes on the bug [22:03:29] Great. Time for bed [22:05:14] mutante: Hi this got merged https://gerrit.wikimedia.org/r/#/c/282478/ but it doesen't work on phabricator.wikimedia.org but it works on my local machine. The only difference is i doint store the files on a separate link and i doint get this type of link https://phabricator.wikimedia.org/F3874805 [22:05:25] To see it working http://www.test-random-wikisaur.tk/diffusion/1/browse/master/index.php [22:05:30] please go to ^^ [22:06:10] restarting again.... [22:06:57] * bd808 is done on tin [22:07:43] paladox: i can confirm it does what you said on your link.. BUT.. i think it's apples and oranges in a way, because one is a Diffusion link and the other an attachment in Maniphest [22:08:14] paladox: please report in -devtools too [22:08:19] ok [22:08:21] thanks [22:08:24] np [22:14:20] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure, and 2 others: Clean up labs graphite datapoints - https://phabricator.wikimedia.org/T111540#2208576 (10Krenair) [22:18:46] mutante: So neither the rsa or the ecdsa key worked :\ [22:22:27] ostriches: any way to get more verbosity? [22:23:23] jouncebot: next [22:23:23] In 0 hour(s) and 36 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T2300) [22:23:41] Hmmm [22:24:02] twentyafterfour: merging your change and checking tin [22:24:27] mutante: awesome thank you [22:25:16] mutante: On, on the misc files thing. https://gerrit.wikimedia.org/r/#/c/280481/ too, trivial [22:25:37] deploy-phabricator has been created on tin [22:26:05] Notice: /Stage[main]/Admin/Admin::Groupmembers[deploy-phabricator]/Exec[deploy-phabricator_ensure_members]/returns: executed successfully [22:26:10] twentyafterfour: done, no problems [22:28:35] mutante: sweet. so in yaml, the spaces matter. "members: [twentyafterfour, other, people]" works but "members: [twentyafterfour,other,people]" [22:28:40] ostriches: hehe, blaming is always the default, right :) "$feature{'blame'}{'default'} = [1]" [22:29:12] thanks thcipriani for figuring that out I was blind to it (didn't see the missing spaces no matter how much I looked at that change) [22:30:02] twentyafterfour: :D glad that worked. It's weird. The version of ruby I had locally, yaml.load_file handled no spaces just fine. [22:30:24] twentyafterfour: ah, yea, it's picky about that , "colon+space" is the separator [22:30:44] eh, comma in this case [22:31:58] ostriches: gitweb_config.pl is gone [22:32:05] thx [22:32:18] Oh, and I can't adjust logging on the fly. Need to upgrade for that :\ [22:32:27] yea, about the lead issue, can we get back to it later [22:32:36] i'm afraid i have to go afk to pick something up [22:32:38] I'm gonna keep poking it on my side. [22:32:42] I'm pretty sure I can figure it out [22:32:44] ok [22:47:23] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2198925 (10BBlack) It's about enabling HTTPS with valid certificates for all 3, but not about enforcing it with a redirect (or anything else more advanced). We actually **can't**... [23:00:04] RoanKattouw ostriches Krenair MaxSem awight Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160414T2300). [23:00:04] jan_drewniak Dereckson Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:27] o/ [23:01:21] I can do it quickly. [23:01:27] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2208685 (10BBlack) Considering that watchmouse's own status pages, e.g. http://status.cloudmonitor.ca.com/ and http://stations.status.cloudmonitor.ca.com/ don't offer HTTPS at all (connectio... [23:01:48] jan_drewniak: You're first. [23:02:38] ostriches: sounds good [23:04:17] Hmmm. [23:04:20] zuul backed up [23:04:56] Err. [23:05:07] Hi. [23:06:55] !log demon@tin Synchronized portals/: updating to master, removing top-links A/B test (duration: 00m 45s) [23:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:09] jan_drewniak: ^^^ [23:08:00] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Add images.unsplash.com to $wgCopyUploadsDomains (duration: 00m 33s) [23:08:19] ostriches: one more thing, there's a sync-portals script that needs to run after the deploy [23:08:22] Dereckson: ^^^ [23:08:43] jan_drewniak: Where is it? Is this documented? :) [23:09:21] ostriches: it's at the root of the repo. It's only a couple of lines, so not documented [23:09:37] Well, I mean documented as to it needing to be run after deploying :) [23:10:10] Testing. [23:10:19] ostriches: where would I document that? [23:10:55] I dunno.... [23:11:16] !log demon@tin Synchronized php-1.27.0-wmf.21/includes/resourceloader/ResourceLoaderSpecialCharacterDataModule.php: I3e26d08a (duration: 00m 30s) [23:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:23] Krenair: ^^^ [23:11:25] ostriches: works [23:12:21] ostriches: I'll make a note of it on the SWAT schedule next time :) [23:12:26] Ok thx [23:12:31] thanks ostriches , looking [23:12:36] Thanks for the deploy. [23:12:47] !log demon@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 29s) [23:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:08] Those last 2 are from sync-portals. [23:13:17] !log demon@tin Synchronized portals: (no message) (duration: 00m 29s) [23:13:20] Really, the first two lines are already handled by sync-dir. [23:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:31] I'm sure we could add that purge somewhere to be automated. [23:13:37] :) [23:14:08] ostriches: woohoo! (works) automating that sounds like a good idea [23:15:06] I'll ask some people in discovery about that [23:15:57] ostriches, looks good [23:16:36] Yay! Swat done. Thanks for playing! [23:16:47] Krenair: Do you know how to kick the gerrit bot? [23:17:23] * Reedy kicks grrrit-wm [23:17:30] YuviPanda has to do it these days IIRC [23:17:38] used to [23:41:45] Any roots about at the moment? [23:54:17] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2208814 (10BBlack) They might also have the option to simply use a hostname within their domains rather than bothering with another name of ours. e.g. configuring it as `wikimedia.status.as... [23:58:19] bblack: You about? I could use a puppet merge.