[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T0000). [00:00:04] Hauskatze: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:11] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:11] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:22] my patch is being deployed o/ [00:00:31] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:31] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:00:53] hopefully those ain't being cased by our patch thcipriani (the icinga-wm issues just reported I mean) [00:01:01] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:09] Hauskatze: I haven't pulled the trigger yet, so no [00:01:10] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:11] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:21] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:23] mutante: ^^ [00:03:00] !log thcipriani@tin Synchronized php-1.31.0-wmf.22/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameUserJob.php: [[gerrit:414972|LocalRenameUserJob: escape backreferences in replacement title]] T188171 (duration: 01m 13s) [00:03:00] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:10] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:10] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:14] T188171: LocalRenameUserJob: escape '$' in replacement title - https://phabricator.wikimedia.org/T188171 [00:03:16] Hauskatze: change is live, FYI [00:03:51] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:10] thcipriani: thanks much, if you think I can borrow few minutes more of your time to do some showJobs.php for https://wikitech.wikimedia.org/wiki/Stuck_global_renames ? [00:04:51] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:29] Hauskatze: showjobs for which wiki? [00:05:35] enwiki [00:05:54] wscript showJobs.php --wiki=enwiki --type LocalRenameUserJob [00:06:00] mwscript showJobs.php --wiki=enwiki --type LocalRenameUserJob [00:06:02] and [00:06:08] mwscript showJobs.php --wiki=enwiki --type RenameUserJob [00:06:10] Puppetdb [00:06:22] I think this should resolve its self [00:06:34] As puppetdb should restart its self [00:06:39] Herron ^^ [00:08:46] (03CR) 10Smalyshev: "LGTM except for check_interval thing" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [00:09:00] Hauskatze: both came back 0 [00:09:11] thcipriani: okay, that suit us [00:10:10] thcipriani: mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki "SimonFoundationContinence" "Drytime%$1600" [00:10:20] that should make the job run again [00:10:33] and with our fix, it shouldn't fail again [00:10:59] (03CR) 10Smalyshev: [C: 031] wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [00:12:38] Hauskatze: gives me "Invalid name" [00:13:05] the Drytime one? [00:13:16] sigh [00:14:06] yes Drytime%$1600 [00:14:17] what a mess [00:14:46] maybe it's now a problem in the script only [00:14:53] legoktm will know better for sure [00:14:58] Hauskatze: oh wait, no, it worked [00:15:14] yeah, it's running [00:15:16] <3 [00:15:17] problem was that "$1" was shell expanded. [00:15:32] because double quotes [00:16:11] single quotes then better? [00:16:21] (wikitech docs needs fixing then) [00:16:40] * thcipriani updates [00:17:22] thcipriani: the rename worked but https://meta.wikimedia.org/wiki/Special:CentralAuth/Drytime%$1600 gives error [00:17:30] probably that weird username [00:17:53] I'm not sure I'd bother in fixing that, the username chosen is innapropriate to me [00:18:03] I'm tempted to revert the rename back [00:19:00] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:19:29] https://meta.wikimedia.org/wiki/Special:CentralAuth/Drytime%25%241600 [00:21:49] it works now? [00:22:19] on Chrome and on metawiki, it displays Error 400 [00:23:17] with the &target param works [00:23:25] also, the account is unnatached [00:24:25] !log OS install on wdqs200[4-6] [00:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:19] thcipriani: okay so I think it's all done for now. There are some issues --I think-- unrelated to this that ought to be looked into later, but for now I think we're good. [00:27:41] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10bd808) @jcrespo You flagged this in the last SRE meeting as needing #cloud-services-team help to finish up. Let me know what we can do, and... [00:27:44] Hauskatze: great! glad to hear it. [00:28:00] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:28:10] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:28:10] win 8 [00:29:11] thcipriani: so to summarize: the patch we merged fixed the (Local)RenameJob job so it doesn't break with those weird characters; the script we ran unblocked the rename and demonstrates that the patch merged works as expected; but the rename has left unnatached accounts and it seems CentralAuth do not like those weird characters in their search boxes [00:29:51] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:29:51] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:29:51] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:30:04] those two issues can be fixed, #1 running attachAccount IIRC, but it is not documented anywhere I know so I'll leave legoktm to look into that and #2 I don't know [00:30:10] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:30:11] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:30:31] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:30:31] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:30:50] I suspect $% somewhat broke the attachment process [00:31:10] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:11] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:20] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:31:21] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:31:26] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4008801 (10ayounsi) [00:33:10] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:41:43] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4008830 (10ayounsi) Added the table to the description with current options. Servers are racked in rack 7 of each rows. 1013=A7, 1014=B7, 1015=C7, 1016=D7. The easiest to do... [00:42:24] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4008831 (10ayounsi) [00:42:47] Krinkle: I'm gonna merge the beta update plugin. Nothing runs it yet so it's harmless, but wanna test in beta. [00:50:32] (03CR) 10Chad: [C: 032] Beta autoupdate: Clean up, support wmf-config itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414909 (owner: 10Chad) [00:51:54] (03Merged) 10jenkins-bot: Beta autoupdate: Clean up, support wmf-config itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414909 (owner: 10Chad) [00:52:13] (03CR) 10jenkins-bot: Beta autoupdate: Clean up, support wmf-config itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414909 (owner: 10Chad) [00:55:32] !log demon@tin Synchronized scap/plugins/wmfbetaautoupdate.py: no-op (duration: 01m 14s) [00:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:30] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3986247 (10Papaul) a:05Papaul>03Gehel @gehel this is all yours [01:37:52] no_justification: So the commit changes an existing plugin, but it wasn't yet used/called from anywhere, is that right? [01:37:59] Yeah [01:43:19] Krinkle: --remote is weird. [01:43:21] I don't like it [01:43:33] no_justification: It uses the tracking branch specified in .gitmodules, right? [01:43:45] but given we already have Gerrit auto-commit updates for that [01:43:51] we probably shouldn't use it [01:43:56] Lemme pastebin [01:44:40] https://phabricator.wikimedia.org/P6754 [01:44:55] Similar issue on Ext:Widgets [01:46:14] Yeah, portals is the only one we wanna do --remote on [01:52:12] (03PS1) 10Chad: wmf-beta-autoupdate: Add --jobs and fix --remote usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415200 [01:52:17] Should do it ^ [01:54:08] (03Abandoned) 10Krinkle: [WIP] profiler: Make entire xhprof-related block conditional on XWD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414934 (owner: 10Krinkle) [01:56:45] PROBLEM - IPMI Sensor Status on wdqs2006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [01:59:03] (03PS1) 10Krinkle: profiler: Implement 'forceprofile' as part of X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415201 (https://phabricator.wikimedia.org/T180183) [02:00:17] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4008967 (10ayounsi) Thanks, asw2-b-eqiad updated accordingly [02:03:21] (03CR) 10Krinkle: "Downstream issue for WikimediaDebug interface is https://github.com/wikimedia/WikimediaDebug/issues/17, but can happen later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415201 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [02:08:17] hello [02:09:26] [16:15:14] problem was that "$1" was shell expanded. <-- I'm starting to be very against $ in usernames [02:09:31] (03PS3) 10BBlack: Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) [02:09:33] (03PS1) 10BBlack: reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) [02:09:35] (03PS1) 10BBlack: Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) [02:10:20] (03CR) 10jerkins-bot: [V: 04-1] reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [02:16:42] Krinkle: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/profiler.php#124 bothers me. I find it bothersome we need to import all of XHGui, pimple, slim and twig to simply dump some profiling data into mongodb [02:16:55] re: multiversion/vendor [02:25:37] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.22) (duration: 06m 21s) [02:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:04] (03PS1) 10Krinkle: profiler-labs: Add CPU and MEMORY flags to XHProf profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415208 [02:29:06] (03PS1) 10Krinkle: profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 [02:29:08] (03PS1) 10Krinkle: profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 [02:29:10] (03PS1) 10Krinkle: profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) [02:30:17] (03CR) 10jerkins-bot: [V: 04-1] profiler-labs: Add CPU and MEMORY flags to XHProf profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415208 (owner: 10Krinkle) [02:30:29] (03CR) 10jerkins-bot: [V: 04-1] profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [02:30:31] (03CR) 10jerkins-bot: [V: 04-1] profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 (owner: 10Krinkle) [02:30:38] (03CR) 10jerkins-bot: [V: 04-1] profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [02:32:23] (03PS2) 10Krinkle: profiler-labs: Add CPU and MEMORY flags to XHProf profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415208 [02:32:25] (03PS2) 10Krinkle: profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 [02:32:27] (03PS2) 10Krinkle: profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 [02:32:29] (03PS2) 10Krinkle: profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) [02:33:47] (03CR) 10jerkins-bot: [V: 04-1] profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [02:33:51] (03CR) 10jerkins-bot: [V: 04-1] profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 (owner: 10Krinkle) [02:33:53] (03CR) 10jerkins-bot: [V: 04-1] profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [02:34:53] (03PS3) 10Krinkle: profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 [02:34:56] (03PS3) 10Krinkle: profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 [02:34:57] (03PS3) 10Krinkle: profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) [02:38:38] (03PS2) 10BBlack: reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) [02:38:40] (03PS2) 10BBlack: Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) [02:40:08] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4008993 (10Tgr) [02:42:24] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4008997 (10Tgr) [02:50:06] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4009008 (10Tgr) As @Anomie noted there, {T186965} is fixed (or will be once the patch is merged) for wikis using Remex but not for ones u... [03:00:04] kart_: I, the Bot under the Fountain, allow thee, The Deployer, to do Run preference migration script for Compact Language Links deployment out of Beta in English Wikipedia (T187677). deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T0300). [03:00:04] No GERRIT patches in the queue for this window AFAICS. [03:00:05] T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677 [03:01:15] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.075 second response time [03:01:42] !log Starting CLL preference migration script on terbium (T187677) [03:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:42] I'm monitoring s1, marostegui (and DBAs) let me know if you notice something unusual. [03:06:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1923 bytes in 0.085 second response time [03:23:42] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4009038 (10faidon) I don't think it's easy for anyone to calculate the amount of effort required for this, but the stated 1-2 year long migration sounds longer than I thought and... pretty scary. I'd like to... [03:41:46] 10Operations, 10DNS, 10Traffic: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4009069 (10Dzahn) Is there a specific thing you want to achieve with this move? I can do this but would be nice to have on the ticket a tiny bit of rea... [03:51:40] 10Operations, 10DNS, 10Traffic: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4009072 (10Prtksxna) [03:52:12] 10Operations, 10DNS, 10Traffic: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4004707 (10Prtksxna) >>! In T188362#4009069, @Dzahn wrote: > Is there a specific thing you want to achieve with this move? I can do this but would be n... [03:55:17] (03PS1) 10Milimetric: [WIP] Merge requires coordination [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) [03:56:12] (03CR) 10Milimetric: "Take a look at the notes and we can coordinate tomorrow morning. But if you feel confident, go ahead without me." [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) (owner: 10Milimetric) [03:56:47] 10Operations, 10DNS, 10Traffic: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4009079 (10Dzahn) Thank you! I'll take this and it should not be a problem . Regarding the private ticket i don't have permissions to read it yet it se... [04:00:14] (03PS1) 10Imarlier: [WIP] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [04:03:15] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1944 bytes in 0.099 second response time [04:13:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1922 bytes in 0.115 second response time [04:34:10] (03CR) 10Krinkle: [C: 032] profiler: Implement 'forceprofile' as part of X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415201 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:34:20] (03CR) 10Krinkle: [C: 032] profiler-labs: Add CPU and MEMORY flags to XHProf profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415208 (owner: 10Krinkle) [04:35:23] (03Merged) 10jenkins-bot: profiler: Implement 'forceprofile' as part of X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415201 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:36:16] (03Merged) 10jenkins-bot: profiler-labs: Add CPU and MEMORY flags to XHProf profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415208 (owner: 10Krinkle) [04:36:38] ^ Staging on mwdebug1002 [04:38:02] (03CR) 10jenkins-bot: profiler: Implement 'forceprofile' as part of X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415201 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:42:55] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private] [04:50:57] (03PS1) 10Krinkle: profiler: Fix typos in e44316351dac9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415221 [04:51:13] (03CR) 10Krinkle: [C: 032] profiler: Fix typos in e44316351dac9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415221 (owner: 10Krinkle) [04:52:26] (03Merged) 10jenkins-bot: profiler: Fix typos in e44316351dac9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415221 (owner: 10Krinkle) [04:55:05] (03CR) 10Krinkle: [C: 04-2] "Pending testing on Beta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [04:55:16] !log krinkle@tin Synchronized wmf-config/profiler.php: Iba417de75a and Ied984daecd3f5f6 (duration: 01m 06s) [04:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:21] (03PS16) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [05:24:45] PROBLEM - Disk space on rhenium is CRITICAL: DISK CRITICAL - free space: / 1377 MB (3% inode=96%) [05:28:32] (03PS17) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [05:30:46] (03PS18) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [05:34:34] (03PS19) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [05:43:37] !log demon@tin rebuilt and synchronized wikiversions files: (no justification provided) [05:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:55] (03PS1) 10Chad: group0 back to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415222 [05:43:57] (03CR) 10Chad: [C: 032] group0 back to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415222 (owner: 10Chad) [05:45:13] (03Merged) 10jenkins-bot: group0 back to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415222 (owner: 10Chad) [06:16:46] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy - https://phabricator.wikimedia.org/T177498#4009209 (10Legoktm) [06:19:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415228 (https://phabricator.wikimedia.org/T187089) [06:21:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415228 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:22:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415228 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:23:11] It looks script run has stopped with an error. Rechecking. [06:24:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 for alter table (duration: 00m 57s) [06:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415230 [06:29:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415230 (owner: 10Marostegui) [06:30:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415230 (owner: 10Marostegui) [06:31:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 for alter table (duration: 00m 57s) [06:31:48] !log (Re)Starting CLL preference migration script on terbium (T187677) [06:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:03] !log Deploy schema change on db1060 (with replication) - this will cause lag on labs servers - T187089 T185128 T153182 [06:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:11] T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677 [06:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:25] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:32:25] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:32:26] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:36:00] marostegui: ^^ not affecting enwiki, right? Just to make sure it won't affect script run for CLL. [06:36:12] No, only s2 [06:36:18] cool. Thanks. [06:36:42] I was hit by T95839 in script run. [06:36:43] T95839: CAS update failed on user_touched for user ID - https://phabricator.wikimedia.org/T95839 [06:48:35] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [06:49:35] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [07:00:35] RECOVERY - Disk space on rhenium is OK: DISK OK [07:07:30] (03PS1) 10Marostegui: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415232 [07:08:40] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415232 (owner: 10Marostegui) [07:09:28] (03PS2) 10Marostegui: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415232 [07:11:20] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415232 (owner: 10Marostegui) [07:12:31] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415232 (owner: 10Marostegui) [07:15:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2085 for mariadb and kernel upgrade (duration: 01m 00s) [07:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:56] !log Upgrade kernel and mariadb on db2085 [07:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:47] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415233 [07:24:56] RECOVERY - Long running screen/tmux on eventlog1001 is OK: OK: No SCREEN or tmux processes detected. [07:26:37] (03CR) 10Elukey: [WIP] eventlogging: add systemd support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413362 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [07:33:03] (03CR) 10Marostegui: mariadb: Set up es2001 as the temporary backup target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [07:33:27] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415233 (owner: 10Marostegui) [07:34:56] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415233 (owner: 10Marostegui) [07:36:17] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2085 (duration: 00m 57s) [07:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:37] (03PS1) 10Marostegui: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415235 (https://phabricator.wikimedia.org/T162807) [07:48:37] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415235 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:49:49] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415235 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:51:10] !log Reboot db2062 for mariadb and kernel upgrade [07:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:34] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2062 - T162807 (duration: 00m 57s) [07:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:47] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:52:46] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413362 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [07:55:42] (03PS21) 10Elukey: eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 (https://phabricator.wikimedia.org/T114199) [08:07:00] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994382 (10Joe) >>! In T188045#4007098, @Smalyshev wrote: > I wonder if it's possible to use one of the new servers we're getting in T187766 to restore full capacity if debugging wh... [08:07:25] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [08:07:46] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:06] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:06] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:16] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:08:25] PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:26] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100% [08:08:26] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:35] PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100% [08:08:35] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:36] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:36] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:08:55] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:09:43] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4009338 (10Smalyshev) > If losing one server out of 4 is an issue One out of 3. We don't have a problem //right now//, but if we will have to take down another one - for maintenanc... [08:10:58] !log rebooting remaining mediawiki API servers in eqiad [08:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:46] had a look over mgmt, #ganeti1006 is oom-killing kvm processes, should recover soonish [08:14:45] PROBLEM - ganeti-noded running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [08:14:55] RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:15:05] PROBLEM - ganeti-mond running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [08:15:45] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [08:15:45] RECOVERY - ganeti-noded running on ganeti1006 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [08:15:55] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [08:15:55] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [08:15:55] RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [08:15:55] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [08:16:05] RECOVERY - ganeti-mond running on ganeti1006 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [08:16:05] RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [08:16:15] RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 64%, RTA = 0.57 ms [08:16:15] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [08:16:15] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [08:16:25] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [08:16:35] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [08:17:55] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [08:18:35] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:19:05] PROBLEM - ElasticSearch health check for shards on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.37:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.37, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f40dfde5b90: Failed to establish a new connection: [Errno 111] Connection ref [08:19:11] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415238 [08:19:35] (03CR) 10jenkins-bot: profiler-labs: Add CPU and MEMORY flags to XHProf profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415208 (owner: 10Krinkle) [08:19:39] (03CR) 10jenkins-bot: profiler: Fix typos in e44316351dac9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415221 (owner: 10Krinkle) [08:19:44] (03CR) 10jenkins-bot: group0 back to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415222 (owner: 10Chad) [08:19:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415228 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [08:19:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415230 (owner: 10Marostegui) [08:19:58] (03CR) 10jenkins-bot: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415232 (owner: 10Marostegui) [08:20:02] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415233 (owner: 10Marostegui) [08:20:07] (03CR) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415235 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:20:57] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415238 (owner: 10Marostegui) [08:22:13] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415238 (owner: 10Marostegui) [08:23:22] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2062 - T162807 (duration: 00m 57s) [08:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:36] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:24:41] (03CR) 10Vgutierrez: [C: 032] Provide BGP session state visibility for every ASN/peer [debs/pybal] - 10https://gerrit.wikimedia.org/r/414973 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [08:25:09] (03Merged) 10jenkins-bot: Provide BGP session state visibility for every ASN/peer [debs/pybal] - 10https://gerrit.wikimedia.org/r/414973 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [08:29:18] (03PS1) 10Marostegui: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415241 (https://phabricator.wikimedia.org/T162807) [08:31:15] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415241 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:32:25] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415241 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:33:40] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2069 - T162807 (duration: 00m 57s) [08:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:53] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:34:37] (03CR) 10Joal: "Comment inline. The plan is good for me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) (owner: 10Milimetric) [08:34:53] !log Reboot db2069 for kernel upgrade [08:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:16] I'm checking logstash1007, unhappy after ganeti1006 reboot [08:38:45] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational [08:38:46] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415238 (owner: 10Marostegui) [08:38:50] (03CR) 10jenkins-bot: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415241 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:38:52] nevermind [08:39:04] running puppet "fixed" it [08:39:36] or maybe no [08:40:34] java.lang.IllegalArgumentException: unknown setting [ltr.caches.max_mem] [08:40:37] gehel: ^ [08:40:57] I remember seeing some ltr reviews yesterday perhaps [08:41:18] ganeti1006 didn't reboot BTW, it only OOM-killed all the kvm instances it was running [08:41:45] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:42:12] !log filippo@neodymium conftool action : set/pooled=no; selector: name=neodymium.eqiad.wmnet [08:42:16] godog: thanks, checking... [08:42:21] !log filippo@neodymium conftool action : set/pooled=yes; selector: name=neodymium.eqiad.wmnet [08:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:31] oops (neodymium) [08:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:51] gehel: ok! I was going to depool logstash1007, I can proceed with it [08:43:09] <_joe_> moritzm: all of them? wow [08:43:35] (03CR) 10Giuseppe Lavagetto: conftool: add json-schemas for MediaWiki variables validation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415046 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [08:43:38] moritzm: ah! [08:43:48] ok, I know what the issue is, fix coming up... [08:44:01] did all the logstash servers restart? [08:44:22] only 1007 afaik [08:44:26] ok, that explains... [08:44:44] there is a cofnig setting that should be only on the cirrus cluster, not on logstash [08:47:41] (03PS1) 10Gehel: logstash: removing LTR plugin configuration [puppet] - 10https://gerrit.wikimedia.org/r/415242 [08:47:59] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415243 [08:48:18] godog: ^ [08:49:12] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4009389 (10elukey) >>! In T188294#4006219, @Ottomata wrote: >> We still haven't tested Hadoop packages on stretch > > We kinda have, just not s... [08:49:16] (03CR) 10Filippo Giunchedi: [C: 031] logstash: removing LTR plugin configuration [puppet] - 10https://gerrit.wikimedia.org/r/415242 (owner: 10Gehel) [08:49:34] gehel: looks good! [08:49:50] godog: thanks! Just running puppet compiler, and merging... [08:50:15] ACKNOWLEDGEMENT - DPKG on restbase-dev1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Muehlenhoff Caused by broken disk, see T185494 [08:50:15] ACKNOWLEDGEMENT - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Muehlenhoff Caused by broken disk, see T185494 [08:50:25] I forgot that elasticsearch is very picky about its config and actually fails if it sees an unknown setting... [08:52:21] wat? puppet compiler tells me there is no change... [08:52:43] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415243 (owner: 10Marostegui) [08:53:54] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415243 (owner: 10Marostegui) [08:55:00] (03PS2) 10Gehel: logstash: removing LTR plugin configuration [puppet] - 10https://gerrit.wikimedia.org/r/415242 [08:55:17] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2069 - T162807 (duration: 00m 57s) [08:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:32] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:55:46] (03PS1) 10Filippo Giunchedi: aptrepo: add puppetdb4 component [puppet] - 10https://gerrit.wikimedia.org/r/415244 (https://phabricator.wikimedia.org/T184562) [08:55:55] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational [08:58:18] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415243 (owner: 10Marostegui) [08:58:43] (03CR) 10Gehel: [C: 032] "puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler02/10166/" [puppet] - 10https://gerrit.wikimedia.org/r/415242 (owner: 10Gehel) [08:58:50] (03CR) 10Muehlenhoff: [C: 031] "Looks good. IS this self-built or coming from the puppetlabs repo?" [puppet] - 10https://gerrit.wikimedia.org/r/415244 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [08:58:55] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:59:55] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational [09:00:25] RECOVERY - ElasticSearch health check for shards on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 55, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active [09:00:25] alizing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [09:01:52] (03PS1) 10Marostegui: db-codfw.php: Depool db208{3-1} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415245 [09:04:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db208{3-1} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415245 (owner: 10Marostegui) [09:04:15] ACKNOWLEDGEMENT - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private] Giuseppe Lavagetto T185970 [09:05:23] (03Merged) 10jenkins-bot: db-codfw.php: Depool db208{3-1} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415245 (owner: 10Marostegui) [09:05:35] 10Operations, 10Analytics, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4009435 (10elukey) >>! In T188377#4006258, @Ottomata wrote: > stars: https://github.com/wikimedia/puppet-zookeeper/stargazers > watchers: https://github.com/w... [09:06:33] !log Reboot db2083, db2082 and db2081 for kernel and mariadb upgrade [09:06:41] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2083, db2082 and db2081 for kernel upgrade (duration: 00m 56s) [09:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:29] (03CR) 10jenkins-bot: db-codfw.php: Depool db208{3-1} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415245 (owner: 10Marostegui) [09:08:33] gehel: \o/ [09:09:05] godog: yeah, that was a stupid mistake on my side. Thanks for catching it up before it went horribly wrong! [09:09:30] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/415244 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [09:10:04] <_joe_> win 19 [09:10:07] <_joe_> meh [09:10:35] gehel: np! but yeah if elasticsearch fails on unknown config the features should be opt-in in puppet not opt-out IMHO [09:10:58] godog: yep, I corrected that in the second patch [09:11:23] ah! even better [09:12:28] (03PS1) 10Ema: cache_text: upgrade codfw to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/415246 (https://phabricator.wikimedia.org/T184448) [09:16:26] (03CR) 10Muehlenhoff: "Yeah, let's use systemd::service_unit" [puppet] - 10https://gerrit.wikimedia.org/r/413362 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [09:18:39] (03CR) 10Ema: [C: 032] cache_text: upgrade codfw to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/415246 (https://phabricator.wikimedia.org/T184448) (owner: 10Ema) [09:22:03] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db208{3-1}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415247 [09:25:03] !log upgrade cache_text@codfw to varnish 5 [09:25:10] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db208{3-1}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415247 (owner: 10Marostegui) [09:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:08] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db208{3-1}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415247 (owner: 10Marostegui) [09:27:25] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [09:27:35] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2083, db2082 and db2081 after kernel upgrade (duration: 00m 57s) [09:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:26] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:36:39] (03CR) 10Jcrespo: "Thanks." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [09:40:40] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db208{3-1}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415247 (owner: 10Marostegui) [09:40:57] (03PS7) 10Jcrespo: mariadb: Set up es2001 as the temporary backup target [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) [09:51:51] (03CR) 10Marostegui: [C: 031] mariadb: Set up es2001 as the temporary backup target [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [09:57:16] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4009514 (10Gehel) A few things to check (thanks for the pointers from my fellow ops): * is another server stealing its IP ** check DNS -> nothing suspicious ** check DHCP leases **... [09:58:46] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.15 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/415006 (https://phabricator.wikimedia.org/T187822) (owner: 10Gilles) [09:59:14] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4009517 (10Gehel) a:05Gehel>03Cmjohnson @Cmjohnson this looks like an issue with the physical connection. Could you try moving the cable to another port on the switch so that we... [10:07:17] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active, AS2914/IPv4: Active [10:07:36] (03PS1) 10Elukey: Allow a kafkatee instance to be configured with no output [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/415253 [10:08:51] (03PS3) 10Giuseppe Lavagetto: conftool: add json-schemas for MediaWiki variables validation [puppet] - 10https://gerrit.wikimedia.org/r/415046 (https://phabricator.wikimedia.org/T185080) [10:09:14] uh, what does that bgp alert mean? 'AS2914/IPv6: Active, AS2914/IPv4: Active' seems like good news? [10:09:59] probably librenms can give us a better insight, that alarm is sometimes confusing [10:11:34] (03PS2) 10Elukey: Allow a kafkatee instance to be configured with no output [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/415253 [10:11:39] elukey: yeah good point [10:11:53] BGP Session Down: 129.250.204.5 (AS2914) [10:11:59] BGP Session Down: 2001:418:0:5000::6fa (AS2914) [10:12:48] oh, and that AS is NTT (scheduled maintenance in progress) [10:13:21] yep! all good then :) [10:13:51] CC: XioNoX ^ [10:14:59] thanks [10:15:42] We have several providers, if 1 goes down, usually not a big deal (unless it flaps), but if more, it can be an issue [10:16:02] (03CR) 10Elukey: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10168/" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/415253 (owner: 10Elukey) [10:16:18] also FYI: https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status [10:16:31] ema: ^ [10:18:06] XioNoX: thanks, I keep on forgetting about that document [10:18:40] I know that up to date doc is surprising at wiki :) [10:19:41] I can't help but hear your voice while reading that [10:20:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/415046 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [10:21:27] (03PS1) 10Elukey: role::netinsight: remove output include directives from config [puppet] - 10https://gerrit.wikimedia.org/r/415254 [10:24:01] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10169/" [puppet] - 10https://gerrit.wikimedia.org/r/415254 (owner: 10Elukey) [10:26:07] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [10:26:20] \o/ [10:39:20] (03CR) 10Volans: [C: 04-1] "Thanks for the fixes! LGTM, I have a small half-nitpick comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415047 (owner: 10Ema) [10:40:55] (03PS8) 10Jcrespo: mariadb: Set up es2001 as the temporary backup target [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) [10:41:22] (03PS1) 10Gilles: Distribution information isn’t available during debian package build [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/415256 (https://phabricator.wikimedia.org/T187350) [10:41:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Set up es2001 as the temporary backup target [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [10:43:12] (03PS9) 10Jcrespo: mariadb: Set up es2001 as the temporary backup target [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) [10:43:28] !log rebooting remaining mediawiki app servers in eqiad [10:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:59] (03CR) 10Filippo Giunchedi: [C: 032] Distribution information isn’t available during debian package build [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/415256 (https://phabricator.wikimedia.org/T187350) (owner: 10Gilles) [10:44:22] (03CR) 10Jcrespo: "> Patch Set 7: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [10:45:43] (03PS4) 10Giuseppe Lavagetto: conftool: add json-schemas for MediaWiki variables validation [puppet] - 10https://gerrit.wikimedia.org/r/415046 (https://phabricator.wikimedia.org/T185080) [10:45:50] (03CR) 10Marostegui: "> > Patch Set 7: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [10:46:27] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add json-schemas for MediaWiki variables validation [puppet] - 10https://gerrit.wikimedia.org/r/415046 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [10:50:23] (03PS1) 10Marostegui: tendril.my.cnf.erb: Disable binlog [puppet] - 10https://gerrit.wikimedia.org/r/415257 (https://phabricator.wikimedia.org/T184704) [10:54:24] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/10171/" [puppet] - 10https://gerrit.wikimedia.org/r/415257 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [10:54:34] !log draining restbase2001 for eventual reboot for kernel security update [10:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:31] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet [10:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] !log upload python-thumbor-wikimedia 1.15 - T187822 T187350 [11:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:08] T187350: Add thumbor and thumbor-plugins versions to log entries/errors - https://phabricator.wikimedia.org/T187350 [11:00:08] T187822: Have Thumbor use a different Swift user when dealing with private containers - https://phabricator.wikimedia.org/T187822 [11:01:28] 10Operations, 10Ops-Access-Requests: reinstate ezachte's access - https://phabricator.wikimedia.org/T188335#4009668 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Erik's new key has been added. [11:02:01] 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4009676 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:02:12] 10Operations, 10monitoring: Upgrade to Prometheus 2.x - https://phabricator.wikimedia.org/T187987#4009677 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:05:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4009688 (10Marostegui) a:03Papaul This host is failing almost everyday (the same slot). So I am starting to believe it is the controller and not the disks anymore. [11:05:52] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4002363 (10jcrespo) @Papaul what disks are you using as replacement? [11:06:17] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 90, down: 0, shutdown: 2 [11:06:53] (03CR) 10Jcrespo: [C: 031] tendril.my.cnf.erb: Disable binlog [puppet] - 10https://gerrit.wikimedia.org/r/415257 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [11:07:07] (03PS2) 10Marostegui: tendril.my.cnf.erb: Disable binlog [puppet] - 10https://gerrit.wikimedia.org/r/415257 (https://phabricator.wikimedia.org/T184704) [11:07:53] (03CR) 10Marostegui: [C: 032] tendril.my.cnf.erb: Disable binlog [puppet] - 10https://gerrit.wikimedia.org/r/415257 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [11:10:00] !log rollout thumbor 1.15 to codfw/eqiad [11:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:18] Hi! Please deploy MediaWiki 1.31/wmf.23 in group 1, see old version in https://www.mediawiki.org/wiki/Special:Version. [11:14:38] (03PS1) 10Vgutierrez: pybal: icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) [11:15:06] (03CR) 10jerkins-bot: [V: 04-1] pybal: icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [11:16:01] joaquinito01_ as I said on the other channel, if you need help #wikimedia-tech is probably the right place [11:16:03] How to deploy a MediaWiki 1.31/wmf.23 in group 1. [11:16:06] ? [11:16:12] joaquinito01_, that's not operations territory [11:16:24] joaquinito01_: How? You don't and wait. [11:16:33] How to deploy a MediaWiki 1.31/wmf.23 in group 1? [11:16:46] joaquinito01_, okay, so you are a bot. [11:16:53] I was thinking that [11:16:58] Not bot. [11:17:00] How to deploy a MediaWiki 1.31/wmf.23 in group 1? [11:17:08] but didn't want to guess on my first 2 interactions [11:17:20] joaquinito01_, yeah bot [11:18:10] Not bot. [11:18:14] How to deploy a MediaWiki 1.31/wmf.23 in group 1? [11:18:20] (03PS2) 10Vgutierrez: pybal: icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) [11:18:27] joaquinito01_, yeah bot [11:18:50] (03CR) 10jerkins-bot: [V: 04-1] pybal: icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [11:18:54] !log powercycling restbase2001, stuck in reboot [11:19:02] same bot as on Oct 25 2017, actually. [11:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:19] (03PS1) 10Giuseppe Lavagetto: conftool: use JSON for json-schema files [puppet] - 10https://gerrit.wikimedia.org/r/415261 [11:20:40] (03PS3) 10Vgutierrez: pybal: icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) [11:21:10] And in group 0. [11:21:37] (03PS2) 10Giuseppe Lavagetto: conftool: use JSON for json-schema files [puppet] - 10https://gerrit.wikimedia.org/r/415261 [11:21:52] <_joe_> andre__: should I act? [11:22:26] _joe_, feel free to. Same messages from same IP range (78.30.*) on #wikimedia-cloud on Oct 25, 2017, just for the records :) [11:22:32] * andre__ shrugs [11:23:05] thanks [11:24:30] <_joe_> before someone feels I'm abusing my power by not deopping immediately... [11:24:34] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: use JSON for json-schema files [puppet] - 10https://gerrit.wikimedia.org/r/415261 (owner: 10Giuseppe Lavagetto) [11:26:08] (03CR) 10Volans: "I've just done a quick pass." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [11:35:33] !log rebooting eqiad job runners for kernel security update [11:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:35] (03CR) 10Gilles: "@bblack have you already scheduled to deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899) (owner: 10Gilles) [11:37:09] !log Reset slave all on db2093 - T184704 [11:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:23] T184704: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704 [11:39:12] andre__: ah, so this is the bot you 've been talking about. [11:39:18] interesting [11:43:48] (03PS1) 10Filippo Giunchedi: hieradata: add private wikis thumbor swift user [puppet] - 10https://gerrit.wikimedia.org/r/415263 (https://phabricator.wikimedia.org/T187822) [11:47:59] (03PS3) 10Ema: wmf-upgrade-and-reboot: upgrade the given host and reboot it [puppet] - 10https://gerrit.wikimedia.org/r/415047 [11:48:26] (03CR) 10Ema: wmf-upgrade-and-reboot: upgrade the given host and reboot it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415047 (owner: 10Ema) [11:48:31] (03CR) 10jerkins-bot: [V: 04-1] wmf-upgrade-and-reboot: upgrade the given host and reboot it [puppet] - 10https://gerrit.wikimedia.org/r/415047 (owner: 10Ema) [11:49:40] !log draining restbase2002 for eventual reboot for kernel security update [11:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:45] (03PS4) 10Ema: wmf-upgrade-and-reboot: upgrade the given host and reboot it [puppet] - 10https://gerrit.wikimedia.org/r/415047 [11:55:44] (03PS1) 10Ladsgroup: Reduce the batch size of statment usage tracking to 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) [11:56:59] (03CR) 10jerkins-bot: [V: 04-1] Reduce the batch size of statment usage tracking to 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [11:59:43] (03CR) 10Ladsgroup: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [12:00:56] !log Reboot db1115 tendril master to pick up new my.cnf options - T184704 [12:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:10] T184704: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704 [12:01:30] (03CR) 10jerkins-bot: [V: 04-1] Reduce the batch size of statment usage tracking to 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [12:02:24] marostegui: If CLL preference script is running, is it likely cause issue while SWAT is going on? It looks it won't finish in time. [12:02:36] jynus: ^ [12:02:46] kart_: Sorry, I have no context of what you are talking about :-) [12:02:53] kart_: Can you elaborate a bit? [12:03:21] marostegui: https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_February_28 See current deployment. [12:03:48] marostegui: running on terbium as normally we do for updating prefs. [12:03:55] (03CR) 10Gilles: [C: 031] hieradata: add private wikis thumbor swift user [puppet] - 10https://gerrit.wikimedia.org/r/415263 (https://phabricator.wikimedia.org/T187822) (owner: 10Filippo Giunchedi) [12:03:56] kart_: Ah right, that one. [12:04:34] kart_: I don't really have any problems with it running out of its reserved window, you might need to ask the swat deployers: hashar or zeljkof typically during this time of the day [12:05:05] (03PS3) 10Gilles: Add Thumbor private container user configuration keys [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) [12:05:07] Yep. I scheduled deployment. will adjust it and ask before SWAT starts. [12:05:25] otherwise, will stop and continue in free slot. [12:05:30] marostegui, kart_: if it does not cause any problems, I am fine with it running during swat ;) [12:05:44] (03PS1) 10Jcrespo: Depool labsdb1011 to copy its data away [puppet] - 10https://gerrit.wikimedia.org/r/415265 (https://phabricator.wikimedia.org/T186579) [12:05:53] zeljkof: unlikely, but just to make sure. [12:06:16] kart_: I'll keep that in mind, and feel free to remind me before swat [12:07:11] kart_: the main issue is version changing [12:07:18] Thanks. will do that. [12:07:21] if the script can stop and start [12:07:33] it would be nice to do that after a train deploy [12:07:36] jynus: during train? [12:07:41] jynus: right, right. [12:07:46] so we are not running code with old codebase [12:07:50] if that makes sense [12:08:09] jynus: I'm running on wmf.22, we're not yet wmf.23 on anywiki yet? (/me checks again) [12:08:26] you are right- my ping was if it takes a day [12:08:29] jynus: I'll make sure it won't intrupt train at all. [12:08:32] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4009851 (10Marostegui) [12:08:36] and in the middle enwiki changes version [12:08:42] yes. [12:08:44] not sure when it has been scheduled [12:09:19] swat in theory should not affect it, but it would be nice to check from time to time metrics and output [12:09:39] e.g. start running it at the start of your day, etc. [12:09:44] I'll watch s1 [12:10:03] I think it touched preferences [12:10:09] which doesn't have any scheduled alter [12:10:14] that is the main issue for us [12:11:17] I tried scheduling as soon as possible I can (8.30 AM), let's see how it goes. [12:14:11] kart_: I am fine having change deployed earlier than the European SWAT :] [12:14:29] and I will be happy to assist in the deployment whenever you need [12:15:02] (hey we could even create an "India SWAT" slot :D ) [12:15:34] Someday :) [12:15:38] yeah, it is about time tim has to handle mediawiki fires! [12:15:44] :-) [12:16:07] hashar: thanks. issue is we need to finish script, and then only deploy the config. [12:16:22] one last thing [12:16:35] that I was surprised when I talked to other devels [12:16:43] kart_: you know screen, right? [12:16:49] jynus: yep [12:16:52] ok [12:16:57] jynus: using it. [12:17:13] other people looked me strange when I suggested it, just checking ;-) [12:17:18] heh [12:17:33] running long script without screen is big NO. [12:17:44] (I personally use tmux, wherever I can) [12:17:49] yeah, whatever [12:19:05] it is just that playing whacamole on terbium with so many .php processes is not easy [12:19:49] (03PS3) 10Mark Bergsma: Add unit tests for Coordinator methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 [12:21:46] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4009907 (10Marostegui) We have db1113 as db1114 as spare btw. (they are large servers) [12:26:12] kart_: we have some global renames running right now, jynus suggest that I talk to you [12:26:23] although I'm not sure why [12:26:47] (03CR) 10Marostegui: [C: 031] Depool labsdb1011 to copy its data away [puppet] - 10https://gerrit.wikimedia.org/r/415265 (https://phabricator.wikimedia.org/T186579) (owner: 10Jcrespo) [12:26:59] yeah, I would prefer massive renames that could touch user preferences will not run while the maintenance script is ongoing [12:27:47] for the moment it's just the once [12:28:05] but we can disable global renaming altogether for a while if you need to [12:28:11] it's happened in the past [12:28:38] just coordinate to when it is beeing run, I don't think it is a hard requirement [12:28:53] kart_ may now the timestamps [12:30:46] Hauskatze: hi [12:31:15] Hauskatze: is it already started? [12:31:46] kart_: that big +100 000 edits yeah :( [12:31:53] Hauskatze: and how long it will take? [12:31:54] guy forgot to notify Phabricator [12:32:02] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009933 (10Marostegui) [12:32:07] Hauskatze: Oops. Also, not in Deployment page? [12:32:15] with the number of wikis and the editcount, no less than one hour and a half [12:32:17] (don't know if it need to be) [12:32:25] if nothing breaks in the interim [12:32:40] (nb: I didn't do it, in fact I was declining the request) [12:33:13] kart_: global renames do not need clasically its own window [12:33:20] Hauskatze: CLL preference script is going on right now, will likely to continue for couple of more hours. [12:33:24] just need devel attention if they break [12:33:33] jynus: I see. I'm unaware about it. [12:33:46] or queing them, so we do not run them in parallel, causing load issues [12:33:48] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415271 (https://phabricator.wikimedia.org/T187886) [12:33:58] this is only until we "fix" renames so they are instant [12:34:16] Hauskatze: OK. Let me know if something breaks. Or I'll ping you if something breaks from my side. [12:34:39] kart_: okay, but I don't have db access so I can't do much, but that's okay [12:34:59] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad,db-codfw.php: Remove db2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415271 (https://phabricator.wikimedia.org/T187886) (owner: 10Marostegui) [12:35:07] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009956 (10Marostegui) [12:35:19] !log draining restbase2003 for eventual reboot for kernel security update [12:35:28] Hauskatze: can I read more about how it is running, which script etc? [12:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:50] kart_: it is a mediawiki admin command [12:35:50] kart_: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Glorious_Engine [12:35:52] hashar: CI issues? https://integration.wikimedia.org/ci/job/operations-mw-config-php55lint/19307/console [12:36:10] I guess it does a lot of UPDATE $this to $that [12:36:26] not a script, we just ask to communicate when massive ones so they don't interfere with maintenance [12:36:29] it uses Extension:CentralAuth and Extension:RenameUser [12:36:51] (03PS1) 10Marostegui: install_server: Remove db2011 [puppet] - 10https://gerrit.wikimedia.org/r/415274 (https://phabricator.wikimedia.org/T187886) [12:36:56] so my point was to make you aware of that [12:37:06] jynus: Thanks. [12:37:13] Hauskatze: Thanks for pointers. [12:37:27] marostegui: yeah :( [12:37:32] :( [12:37:56] (03CR) 10Marostegui: [C: 032] install_server: Remove db2011 [puppet] - 10https://gerrit.wikimedia.org/r/415274 (https://phabricator.wikimedia.org/T187886) (owner: 10Marostegui) [12:38:15] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415271 (https://phabricator.wikimedia.org/T187886) (owner: 10Marostegui) [12:39:13] gerrit issues, or just spurious error? [12:39:18] marostegui: that php55lint job has a few issues. But I will eventually get rid of it in favor of running "composer test" [12:39:27] :-) [12:39:30] the job workspace is kept between builds [12:39:35] idwiki and wikidatawiki are the ones with the most edits [12:39:39] yeah, now it worked [12:39:39] and it had a left over .git/config.lock file from a previous build :( [12:39:41] enwiki just have 900+ [12:39:53] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415271 (https://phabricator.wikimedia.org/T187886) (owner: 10Marostegui) [12:40:07] Hauskatze: it is ok, this was more a ping for kart_ [12:40:15] (so in short: php55lint mis behaving is a known issue, fix will happen eventually) [12:40:16] no action needed so far [12:40:21] thanks for the report [12:41:07] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415271 (https://phabricator.wikimedia.org/T187886) (owner: 10Marostegui) [12:41:21] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415271 (https://phabricator.wikimedia.org/T187886) (owner: 10Marostegui) [12:42:43] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db2011 - T187886 (duration: 00m 58s) [12:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:59] T187886: Decommission db2011 - https://phabricator.wikimedia.org/T187886 [12:43:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4009996 (10Marostegui) [12:44:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db2011 - T187886 (duration: 00m 59s) [12:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:32] testing glance alerts so one will be expected in the next few minutes [12:44:33] zeljkof: I rescheduled my config change to next SWAT. Will continue script, unless we notice any issue. [12:44:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4010000 (10Marostegui) a:03RobH All the DBA steps are done. Assigning it to @robh so this can continue Thanks! [12:45:51] (03PS1) 10Ema: cache_text: upgrade eqiad to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/415275 (https://phabricator.wikimedia.org/T184448) [12:46:11] PROBLEM - glance-api http on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 9292: Connection refused [12:46:17] ^me [12:46:35] kart_: /me thumbs up 👍 [12:48:52] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#4010016 (10Addshore) 05stalled>03declined Going to mark this as declined, the hackey script that I put in place has been working for over 2 years. Maybe one day we will ge... [12:49:11] RECOVERY - glance-api http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 817 bytes in 0.076 second response time [12:49:56] 10Operations, 10ops-eqiad: Failed power supply redundancy on wdqs1006 - https://phabricator.wikimedia.org/T188501#4010020 (10MoritzMuehlenhoff) [12:50:04] 10Operations, 10ops-eqiad: Failed power supply redundancy on wdqs1006 - https://phabricator.wikimedia.org/T188501#4010031 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:50:11] ^going once more [12:51:02] ACKNOWLEDGEMENT - IPMI Sensor Status on wdqs2006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Muehlenhoff T188501 [12:56:26] 10Operations, 10Wikimedia-Logstash: logstash group1 dashboard incorrectly shows testwikidatawiki - https://phabricator.wikimedia.org/T184655#4010061 (10Addshore) 05Open>03stalled I imagine this task is pretty easy for some with access / know how. [12:57:12] PROBLEM - glance-api http on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 9292: Connection refused [12:57:58] ^ me again [13:01:12] RECOVERY - glance-api http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 817 bytes in 0.081 second response time [13:03:03] 10Operations, 10Electron-PDFs, 10TCB-Team, 10Patch-For-Review, and 3 others: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#4010082 (10Addshore) 05Open>03Resolved a:03Addshore As far a I can tell this is all done [13:11:40] (03PS7) 10Gehel: wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) [13:13:09] !log draining restbase2004 for eventual reboot for kernel security update [13:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:24] (03PS14) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [13:23:27] (03PS2) 10Jcrespo: Depool labsdb1011 to copy its data away [puppet] - 10https://gerrit.wikimedia.org/r/415265 (https://phabricator.wikimedia.org/T186579) [13:24:31] (03CR) 10Jcrespo: [C: 032] Depool labsdb1011 to copy its data away [puppet] - 10https://gerrit.wikimedia.org/r/415265 (https://phabricator.wikimedia.org/T186579) (owner: 10Jcrespo) [13:28:32] (03PS1) 10Rush: rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 [13:28:38] (03CR) 10Gehel: [C: 032] "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler02/10174/" [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [13:28:49] (03PS15) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [13:28:57] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 (owner: 10Rush) [13:32:26] 10Operations, 10ops-eqiad: Failed power supply redundancy on wdqs1006 - https://phabricator.wikimedia.org/T188501#4010203 (10Gehel) Note: this is one of the new wdqs servers, not in service yet. [13:33:03] (03PS4) 10Mark Bergsma: Add unit tests for Coordinator methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 [13:35:18] hey, zeljkof are you SWATting today? [13:38:08] 10Operations, 10ops-eqiad: Failed power supply redundancy on wdqs1006 - https://phabricator.wikimedia.org/T188501#4010219 (10Gehel) a:05Cmjohnson>03Papaul Strange... according to T188432 the new wdqs servers are wdqs100[7-9]. The current wdqs cluster in eqiad is [[ https://github.com/wikimedia/puppet/blob/... [13:38:28] 10Operations, 10ops-eqiad: Failed power supply redundancy on wdqs2006 - https://phabricator.wikimedia.org/T188501#4010229 (10Gehel) [13:38:31] 10Operations, 10ops-eqiad: Failed power supply redundancy on wdqs2006 - https://phabricator.wikimedia.org/T188501#4010020 (10Gehel) [13:46:22] PROBLEM - Check size of conntrack table on mw1308 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [13:46:53] PROBLEM - Check size of conntrack table on mw1309 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [13:47:33] I have a SWAT question :) if I would like to backport a commit against both wmf.22 and wmf.23, would you say that counts as one or two patches? (against the “max 8 patches” limit) [13:48:16] Lucas_WMDE: 2 patches I'd say, but greg-g is the boss. [13:48:17] Yes [13:48:29] ok thanks [13:48:59] Any idea why 8 and not 10? [13:49:04] (03PS1) 10Rush: labstore: monitoring changes for critical and contacts [puppet] - 10https://gerrit.wikimedia.org/r/415283 (https://phabricator.wikimedia.org/T178405) [13:49:32] (03CR) 10jerkins-bot: [V: 04-1] labstore: monitoring changes for critical and contacts [puppet] - 10https://gerrit.wikimedia.org/r/415283 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [13:49:54] (03CR) 10Vgutierrez: [C: 031] "check comments, LGTM otherwise" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 (owner: 10Mark Bergsma) [13:50:08] 10Operations, 10ops-eqiad: rack/setup/install wdqs100[7-9] - https://phabricator.wikimedia.org/T188432#4010289 (10Gehel) Note: wdqs1006 does not exists (and has never existed to my knowledge). We could name those servers wdqs100[6-8] instead wdqs100[7-9]. [13:50:22] RECOVERY - Check size of conntrack table on mw1308 is OK: OK: nf_conntrack is 79 % full [13:52:52] RECOVERY - Check size of conntrack table on mw1309 is OK: OK: nf_conntrack is 79 % full [13:53:28] (03PS2) 10Ladsgroup: Reduce the batch size of statment usage tracking to 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) [13:56:32] PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 607.26 seconds [13:56:54] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010303 (10jcrespo) db1113 as db1114 are not spares, they were bought to generate backups on eqiad, we need them. [13:57:10] okay, I filled up the SWAT, I hope no one gets mad at me :) [13:57:20] (if anyone else needs a change feel free to kick my last additions out) [13:58:22] PROBLEM - Check size of conntrack table on mw1308 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [13:58:45] raynor: sorry, just saw your question, if nobody else insists, I will SWAT [13:58:53] PROBLEM - Check size of conntrack table on mw1309 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [13:59:18] ok, we have our popups task again [13:59:47] and we will need like 15-20 mins to test it. It would be awesome if we can go first, or for example get our code on second mwtesting [14:00:06] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T1400). [14:00:06] lokal-profil and raynor: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] o/ [14:00:14] I can SWAT today [14:01:18] and my second change, it's beta config change, and yesterday I learned it will go live automatically once merged, it doesn't have to go through SWAT, I just need someone with +2 rights [14:01:19] huh, jouncebot didn’t ping me… [14:01:27] oh, CI is busy :( [14:01:53] PROBLEM - Check size of conntrack table on mw1309 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [14:02:27] the usual questions :) lokal-profil, raynor, Lucas_WMDE: do you want to deploy your changes, if you can? [14:02:43] I cannot, so I’d be very grateful for your assistance again :) [14:02:47] Lucas_WMDE: If you've only just added it.... He needs refreshing [14:02:51] jouncebot: reload [14:02:54] jouncebot: refresh [14:02:57] I refreshed my knowledge about deployments. [14:02:59] * Reedy kicks jouncebot [14:03:08] Reedy: thanks, I was a bit late [14:03:16] I don't think I can deploy myself [14:04:00] raynor: you would like to go first, and 414751 to be deployed to mwdebug1002 first? [14:04:21] yes please [14:04:51] raynor: then I'll make it so :) [14:05:04] awesome, thank you [14:05:43] (03PS2) 10Zfilipin: Enable HTML Previews on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [14:05:55] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [14:06:02] RECOVERY - Check size of conntrack table on mw1309 is OK: OK: nf_conntrack is 77 % full [14:06:12] raynor: merging it, it might take some time since CI is busy [14:06:23] RECOVERY - Check size of conntrack table on mw1308 is OK: OK: nf_conntrack is 78 % full [14:06:55] (03PS2) 10Rush: rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 [14:07:09] raynor: since 414769 touches different file, I'll merge it too, can it be tested at mwdebug1002? [14:07:10] (03Merged) 10jenkins-bot: Enable HTML Previews on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [14:07:22] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 (owner: 10Rush) [14:07:28] or should I just deploy and you will test at beta cluster? [14:07:44] (03PS3) 10Rush: rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 [14:08:04] zeljkof, just deploy, I'll check it on betacluister [14:08:11] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 (owner: 10Rush) [14:08:14] (03CR) 10jenkins-bot: Enable HTML Previews on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [14:08:46] raynor: 414751 is at mwdebug1002, you said you need about 15 minutes to test? [14:08:50] the change is fairly trivial and will affect on beta cluster event logging (popups will send one event more) [14:08:53] (03PS2) 10Filippo Giunchedi: aptrepo: add puppetdb4 component [puppet] - 10https://gerrit.wikimedia.org/r/415244 (https://phabricator.wikimedia.org/T184562) [14:08:55] yes, 15 mins [14:08:57] on it [14:09:19] (03PS4) 10Rush: rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 [14:09:49] doh, rebase wars, I'll hold on chasemp [14:10:05] (03PS5) 10Zfilipin: beta: enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) (owner: 10Pmiazga) [14:10:09] godog: nah, I'm grabbing coffee anyhow, do your thing :) [14:10:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) (owner: 10Pmiazga) [14:10:19] (03CR) 10Filippo Giunchedi: [C: 032] aptrepo: add puppetdb4 component [puppet] - 10https://gerrit.wikimedia.org/r/415244 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [14:10:23] chasemp: ahah ok! [14:10:32] PROBLEM - Check size of conntrack table on mw1308 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [14:11:15] ^ fixing [14:11:19] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010364 (10jcrespo) [14:11:21] zeljkof: was I too late to add my staff to SWAT last minute? [14:11:27] (03Merged) 10jenkins-bot: beta: enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) (owner: 10Pmiazga) [14:11:58] Pchelolo: you have to talk with Lucas_WMDE, swat is already full, some commits will not be deployed :) [14:12:43] (03CR) 10jenkins-bot: beta: enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) (owner: 10Pmiazga) [14:12:52] (03CR) 10Mark Bergsma: Add unit tests for Coordinator methods (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 (owner: 10Mark Bergsma) [14:13:06] (03PS5) 10Mark Bergsma: Add unit tests for Coordinator methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 [14:13:08] (03PS1) 10Mark Bergsma: Add a test case for removing previously existing servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/415294 [14:13:12] I can drop the “bump cache key” backports [14:13:18] or perhaps move them to Morning SWAT [14:13:47] Lucas_WMDE, Pchelolo: I can not guarantee that I will be able to deploy 8 patches in an hour :) [14:13:52] so plan accordingly [14:13:52] zeljkof: kk, in that case you can leave mine out then. I'll remove it from the calendar. [14:14:42] zeljkof: removed mine, we will handle it on SF time [14:15:11] Pchelolo: ok, sorry about that, but better to let you know now than later :) [14:15:15] (03PS2) 10Rush: labstore: monitoring changes for critical and contacts [puppet] - 10https://gerrit.wikimedia.org/r/415283 (https://phabricator.wikimedia.org/T178405) [14:15:16] Hauskatze: that would be why the limit isn’t higher :) [14:15:26] (03CR) 10Mark Bergsma: Add unit tests for Coordinator methods (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 (owner: 10Mark Bergsma) [14:15:32] RECOVERY - Check size of conntrack table on mw1308 is OK: OK: nf_conntrack is 67 % full [14:15:43] (03CR) 10Mark Bergsma: [C: 032] Add unit tests for Coordinator methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 (owner: 10Mark Bergsma) [14:16:11] (03Merged) 10jenkins-bot: Add unit tests for Coordinator methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/406478 (owner: 10Mark Bergsma) [14:16:38] Lucas_WMDE: yep, probably. Sometimes not even two if mighty CI doesn't want to work :) [14:17:13] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:414769|beta: enable VirtualPagePreviews events on beta cluster (T184793 T186728)]] (duration: 00m 57s) [14:17:19] Hauskatze, Lucas_WMDE: if there are no trouble, up to 8 patches might be doable, but any trouble means less patches [14:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:30] T184793: Instrument page interactions - https://phabricator.wikimedia.org/T184793 [14:17:30] T186728: Record and aggregate page previews - https://phabricator.wikimedia.org/T186728 [14:17:35] raynor: 414769 is deployed [14:18:00] (03PS3) 10Rush: labstore: monitoring changes for critical and contacts [puppet] - 10https://gerrit.wikimedia.org/r/415283 (https://phabricator.wikimedia.org/T178405) [14:18:05] (03PS9) 10Zfilipin: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [14:18:57] lokal-profil: your patch is next, but raynor's patch is time consuming to test, so it might be another 5-10 minutes, please stand by [14:19:20] no worries, i'll get some coffee =) [14:20:32] zeljkof: just FYI, my backports are all in code that isn’t active yet (needs a config change), so I won’t really be able to test them… I’d just check that nothing on Grafana blows up when they’re deployed [14:20:54] (but if the config change is deployed without those commits, then it causes problems, hence the backports) [14:21:12] !log gehel@tin Started deploy [tilerator/deploy@455a31a]: adding Brighmed, Meddo and ClearTables to tilerator [14:21:12] Lucas_WMDE: ok, so your commits can be merged and deployed, there is no testing at mwdebug? [14:21:22] yes, that’s what I meant [14:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:40] Lucas_WMDE: can I merge and deploy them all together? or should I do it one by one? [14:21:50] all together should be fine [14:22:05] though I confess I’m not sure what the deployment for the wmf.23 ones means anyways :) [14:22:05] Lucas_WMDE: ok, that will make it doable/faster [14:22:28] we need 5 more minutes [14:22:38] so far - looks good [14:22:56] !log draining restbase2005 for eventual reboot for kernel security update [14:23:02] (03CR) 10Zfilipin: [C: 031] Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [14:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:17] raynor: thumbs up 👍 [14:23:44] btw, the betacluster config change - it' [14:23:48] it's good, thx [14:23:51] Lucas_WMDE: ah, there are a couple of 22 and 3 23, but all wikis are at 22 [14:23:58] raynor: great! [14:24:06] yeah, I saw that the train is still blocked [14:24:24] Lucas_WMDE: I'll deploy them in two groups then, 22 and 23 [14:24:28] ok thanks [14:24:34] if 23 is on tin at all :) [14:24:46] zeljkof: I was also about to ask: is train still blocked? [14:25:02] kart_: I really don't know :) I don't speak train at all ;) [14:25:22] there should be an e-mail or phab task about it, for sure, that is all I know [14:25:35] !log gehel@tin Finished deploy [tilerator/deploy@455a31a]: adding Brighmed, Meddo and ClearTables to tilerator (duration: 04m 27s) [14:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:34] !log rebooting kubestage* for kernel security update [14:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:05] zeljkof, success,. looks good [14:31:16] raynor: ok, deploying [14:31:23] can you push 414751 to production [14:31:27] thx [14:32:41] (03CR) 10Mark Bergsma: [C: 031] "Nice work." [debs/pybal] - 10https://gerrit.wikimedia.org/r/414711 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [14:32:45] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:414751|Enable HTML Previews on all wikipedias (T182319)]] (duration: 00m 57s) [14:32:56] hashar: should I deploy commits to wmf.23 since all wikis are at wmf.22? [14:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:03] T182319: Show HTML summaries on all wikis - https://phabricator.wikimedia.org/T182319 [14:33:15] raynor: deployed, please check and thanks for deploying with #releng! ;) [14:33:34] zeljkof: I guess yes [14:33:52] I have no idea which versions the train is rolling though [14:33:56] lokal-profil: you are next, please stand by, I'll let you know when the commit is at mwdebug1002 for testing, in a few minutes [14:34:20] hashar: that's my thinking, if I merge something while the deployment is still going on... [14:34:26] will I break stuff? :) [14:34:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [14:34:51] zeljkof: that does make any sense :D [14:35:23] zeljkof: the currently deployed version ( wmf.22 ) has some issue that needs to be fixed [14:35:24] hashar: I don't know how train works, just thinking, you think it will be fine to deploy a few commits to 23? [14:35:34] wmf.23 has already been cut so the patch for wmf.22 also have to be applied to wmf.23 [14:35:46] hence why each are for both branch [14:35:50] hashar: there are a few commits for 23 https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T1400 [14:35:55] (03Merged) 10jenkins-bot: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [14:35:55] zeljkof is there any debugging that can be done, considering it only affects sewikimedia settings [14:36:16] lokal-profil: you can check sewiki at mwdebug1002? [14:36:19] zeljkof: so deploy the wmf.22 patches as usualy (testing on mwdebug etc) and if they all work you can then just +2 the wmf.23 patches [14:36:25] lokal-profil: just in case [14:36:37] (and deploy the wmf.23 patch on tin) [14:36:59] hashar: ok, so I just deploy 23 patches as normal, right? [14:37:15] hmm almost [14:37:23] there is no wiki to tests them ;] [14:37:43] so for the wmf.23 patches, +2 , pull on tin and scap :] [14:37:57] hashar: ok, will do [14:38:08] (03PS2) 10Vgutierrez: Provide testing for FSM.BGPTimer [debs/pybal] - 10https://gerrit.wikimedia.org/r/414711 (https://phabricator.wikimedia.org/T188085) [14:38:10] (03CR) 10jenkins-bot: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [14:38:32] lokal-profil: 404942 is at mwdebug1002, can you test there? [14:38:57] !log dropping sqldata on dbstore1001 [14:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:53] (03CR) 10Mark Bergsma: [C: 04-1] "Overall this is nice and actually an improvement towards making/keeping bgp.py independent of pybal again." [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [14:40:41] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/415047 (owner: 10Ema) [14:41:05] (03PS4) 10BBlack: Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) [14:41:07] (03PS3) 10BBlack: reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) [14:41:09] (03PS3) 10BBlack: Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) [14:41:42] (03CR) 10BBlack: [C: 031] "The new python is tested now, works!" [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [14:41:46] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010458 (10Papaul) Disks from the decommissioned servers [14:42:01] (03CR) 10Vgutierrez: [C: 032] Provide testing for FSM.BGPTimer [debs/pybal] - 10https://gerrit.wikimedia.org/r/414711 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [14:42:51] (03Merged) 10jenkins-bot: Provide testing for FSM.BGPTimer [debs/pybal] - 10https://gerrit.wikimedia.org/r/414711 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [14:42:58] works fine =) [14:43:27] (03CR) 10Mark Bergsma: "Yes, this approach works if you're careful but it's also tricky for the reasons mentioned." [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) (owner: 10Vgutierrez) [14:44:35] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4010461 (10fgiunchedi) rhodium with puppetdb-terminus from puppetdb 2.3 works as expected, the only initialization I had to do was to update `/sr... [14:45:07] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010462 (10jcrespo) Maybe we can try a disk we know it is in a good state to see if it is the disks or the controller/other disks, etc. CC @Marostegui ? [14:45:12] ACKNOWLEDGEMENT - puppet last run on lvs1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 29 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[ethtool_rss_combined_channels_eth0],Exec[ethtool_rss_combined_channels_eth1] Ema Decommissioning, we dont care about this. [14:46:48] lokal-profil: ok, will deploy in a minute, I'll let you know [14:47:10] Lucas_WMDE: deploying wmf.22 patches, please stand by [14:47:20] zeljkof: ack [14:47:26] !log zfilipin@tin Synchronized php-1.31.0-wmf.22/extensions/WikibaseQualityConstraints/: SWAT: [[gerrit:415285|Only filter statuses after collecting metadata (T188384)]] (duration: 01m 03s) [14:47:31] keeping an eye on server-board and mysql-aggregated, just in case [14:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:40] T188384: Only filter for result statuses after collecting metadata - https://phabricator.wikimedia.org/T188384 [14:47:46] Lucas_WMDE: deployed the first one ^ [14:48:35] !log zfilipin@tin Synchronized php-1.31.0-wmf.22/extensions/WikibaseQualityConstraints: SWAT: [[gerrit:415287|Don’t query WikiPageEntityMetaDataAccessor with empty list (T188311)]] (duration: 01m 02s) [14:48:38] Lucas_WMDE: ah, actually, deployed both, since I synced the entire extension the first time :) [14:48:47] hehe, okay :D [14:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:50] T188311: Don’t pass empty entity ID lists to WikiPageMetaDataAccessor - https://phabricator.wikimedia.org/T188311 [14:49:10] anyway, both wmf.22 patches are deployed now, please keep your eyes open [14:49:17] will do [14:50:21] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404942|Drop the medlem user group and editallpages user right (T184981)]] (duration: 00m 57s) [14:50:24] lokal-profil: your patch is deployed, please check production and thanks for deploying with #releng! :) [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:35] T184981: Remove the medlem user group and the editallpages user right on se.wikimedia.org - https://phabricator.wikimedia.org/T184981 [14:50:54] Lucas_WMDE: with 10 minutes left, I will review, merge and deploy wmf.23 patches and let you know when they are deployed [14:51:03] (03PS1) 10Filippo Giunchedi: hieradata: repool rhodium [puppet] - 10https://gerrit.wikimedia.org/r/415299 (https://phabricator.wikimedia.org/T184562) [14:51:04] okay, thank you [14:52:17] zeljkof: looks good in prod. Thanks [14:52:56] Lucas_WMDE: ah, there is one more wmf.22 patch, I have only noticed the first two, ok, deploying that one too [14:53:12] !log stopping labsdb1011 to clone it to labsdb1010 T186579 [14:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:28] T186579: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579 [14:54:16] !log rebooting ores in codfw for kernel security update [14:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:20] zeljkof, thanks for deployment [14:56:02] raynor, lokal-profil: no problemo, I am glad I could help! :) [14:56:05] (03PS8) 10Gehel: wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) [14:56:52] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010522 (10Marostegui) >>! In T188286#4010462, @jcrespo wrote: > Maybe we can try a disk we know it is in a good state to see if it is the disks or the controller/other disks, etc. CC @Marostegui ? Agreed.... [14:59:47] zeljkof: still in swat mode? [15:00:07] mobrovac: yes, a few more minutes, 5 or so [15:00:21] kk sure, no pb [15:00:23] waiting for CI mostly, I need just a minute or so to deploy [15:00:50] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010539 (10Papaul) on each decommissioned servers when the disk is blinking before decommissioning, the disk it label bad so I do not have to use it. [15:05:54] (03PS4) 10Rush: labstore: monitoring changes for critical and contacts [puppet] - 10https://gerrit.wikimedia.org/r/415283 (https://phabricator.wikimedia.org/T178405) [15:07:50] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010608 (10Marostegui) I don't want to believe we have such bad luck that all the disks we have used happened to be bad or become bad after a few days :( As I said above maybe it is safer to promote db2055... [15:08:32] Lucas_WMDE: deploying the last wmf.22 patch [15:08:36] ok [15:08:39] so far the servers seem fine [15:10:55] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010624 (10Papaul) I am good with that. If you want to try another disk. [15:11:41] !log zfilipin@tin Synchronized php-1.31.0-wmf.22/extensions/WikibaseQualityConstraints: SWAT: [[gerrit:415289|Bump cache key for check results (T188384)]] (duration: 01m 02s) [15:11:53] Lucas_WMDE: deployed ^ [15:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:54] T188384: Only filter for result statuses after collecting metadata - https://phabricator.wikimedia.org/T188384 [15:11:58] now deploying wmf.23 patches [15:12:06] ok, thank you! [15:13:09] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4010635 (10brion) @RobH we'd still like to buy 2 new machines with this configuration, so if/when the ones taken from the image scaler pool are needed elsewhere we've got... [15:15:05] (03PS2) 10Vgutierrez: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) [15:15:12] (03CR) 10jerkins-bot: [V: 04-1] Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [15:15:29] (03CR) 10Rush: [C: 032] labstore: monitoring changes for critical and contacts [puppet] - 10https://gerrit.wikimedia.org/r/415283 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [15:15:37] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010638 (10Marostegui) Let's go for another disk then! Thanks guys! [15:15:55] !log zfilipin@tin Synchronized php-1.31.0-wmf.23/extensions/WikibaseQualityConstraints/: SWAT: [[gerrit:415288|Don’t query WikiPageEntityMetaDataAccessor with empty list (T188311)]] [[gerrit:415290|Bump cache key for check results (T188384)]] (duration: 01m 02s) [15:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:10] T188311: Don’t pass empty entity ID lists to WikiPageMetaDataAccessor - https://phabricator.wikimedia.org/T188311 [15:16:31] Lucas_WMDE all deployed! please check whatever needs to be checked and thanks for deploying with #releng! ;) [15:16:33] (03CR) 10Ottomata: "Ok, fine with me! :)" [puppet] - 10https://gerrit.wikimedia.org/r/413362 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [15:16:44] mobrovac: all done, sorry, CI took a bit longer than expected [15:16:52] !log EU SWAT finished [15:16:56] (03CR) 10Vgutierrez: "@mark. Right, I got rid of the ugly multiple inheritance" [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [15:17:01] zeljkof: thank you! everything seems fine so far [15:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:44] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010651 (10Marostegui) I know, just saying that as we ordered 8 servers already, we don't really wait to wait for those to arrive if we want to use th... [15:18:25] (03PS1) 10Rush: openstack: labstore monitoring typo fix [puppet] - 10https://gerrit.wikimedia.org/r/415300 (https://phabricator.wikimedia.org/T178405) [15:18:58] (03CR) 10Mark Bergsma: [C: 031] "Looks good, one comment on logging." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) (owner: 10Vgutierrez) [15:19:04] (03CR) 10Rush: [C: 032] openstack: labstore monitoring typo fix [puppet] - 10https://gerrit.wikimedia.org/r/415300 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [15:20:02] (03PS5) 10Ema: wmf-upgrade-and-reboot: upgrade the given host and reboot it [puppet] - 10https://gerrit.wikimedia.org/r/415047 [15:20:57] !log draining restbase2006 for eventual reboot for kernel security update [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:24] !log upgrade cache_text@eqiad to varnish 5 [15:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:44] !log rebooting ores in eqiad for kernel security update [15:22:52] 10Operations, 10ops-eqiad: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4010674 (10Cmjohnson) [15:22:54] (03PS9) 10Gehel: wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) [15:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:59] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4010675 (10jcrespo) Oh, I didn't think about that- you are completely right. [15:23:19] (03CR) 10Ema: [C: 032] wmf-upgrade-and-reboot: upgrade the given host and reboot it [puppet] - 10https://gerrit.wikimedia.org/r/415047 (owner: 10Ema) [15:23:35] (03CR) 10Ema: [C: 032] cache_text: upgrade eqiad to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/415275 (https://phabricator.wikimedia.org/T184448) (owner: 10Ema) [15:23:42] (03PS2) 10Ema: cache_text: upgrade eqiad to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/415275 (https://phabricator.wikimedia.org/T184448) [15:23:44] (03CR) 10Ema: [V: 032 C: 032] cache_text: upgrade eqiad to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/415275 (https://phabricator.wikimedia.org/T184448) (owner: 10Ema) [15:25:58] 10Operations, 10Goal, 10Patch-For-Review, 10User-Elukey, 10User-fgiunchedi: Stop using jmx_exporter deployed via scap in favour of Debian package - https://phabricator.wikimedia.org/T181728#4010693 (10fgiunchedi) >>! In T181728#4006555, @elukey wrote: > ETOOSOON, we'll need to cleanup the scap dirs on th... [15:26:09] 10Operations, 10Analytics, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4010694 (10Ottomata) > Anyhow, let's do kafkatee/varnishkafka/jmxtrans for the moment. Would it be ok? Ya, let's do these. > Half of the watchers are Wikimed... [15:28:59] (03PS1) 10Dzahn: add transparency-private.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/415302 (https://phabricator.wikimedia.org/T188362) [15:32:51] 10Operations, 10Analytics, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4010715 (10elukey) >>! In T188377#4010694, @Ottomata wrote: >> Anyhow, let's do kafkatee/varnishkafka/jmxtrans for the moment. Would it be ok? > Ya, let's do... [15:33:22] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4010716 (10Dzahn) Access is still denied to me on T175445 which is surprising because i'm a member of WMF-NDA and can read securi... [15:34:13] (03CR) 10Volans: [C: 04-1] "Nice! Couple of really minor things." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [15:35:43] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) (owner: 10Vgutierrez) [15:36:35] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10Dzahn) ``` modules/profile/manifests/phabricator/main.pp: srange => '$CACHE_MISC', modules/profile/manifests/ci/firewall.pp: srange => '$CACHE_MISC', modules/profile/manifes... [15:36:39] (03PS2) 10Filippo Giunchedi: hieradata: repool rhodium [puppet] - 10https://gerrit.wikimedia.org/r/415299 (https://phabricator.wikimedia.org/T184562) [15:38:43] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:39:47] (03CR) 10Elukey: deployment-prep: set profile::cache::kafka::webrequest::kafka_cluster_name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/414738 (https://phabricator.wikimedia.org/T188288) (owner: 10Dzahn) [15:43:28] (03PS1) 10Dzahn: deployment-prep: fix kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/415307 [15:44:41] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: repool rhodium [puppet] - 10https://gerrit.wikimedia.org/r/415299 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:45:49] !log repool rhodium as puppet master backend [15:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:25] (03CR) 10Gehel: wdqs: icinga check for categories updates (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [15:47:27] (03CR) 10Ema: "Looks great! A couple of comments and a pcc fail: https://puppet-compiler.wmflabs.org/compiler02/10178/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [15:47:58] (03PS10) 10Gehel: wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) [15:48:25] (03CR) 10jerkins-bot: [V: 04-1] wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [15:48:40] (03PS1) 10Ottomata: The rest of kafka stuff is configured in Horizon hiera interface. Let's set this properly there...doing... :) [puppet] - 10https://gerrit.wikimedia.org/r/415308 [15:48:46] (03CR) 10Ottomata: [V: 032 C: 032] The rest of kafka stuff is configured in Horizon hiera interface. Let's set this properly there...doing... :) [puppet] - 10https://gerrit.wikimedia.org/r/415308 (owner: 10Ottomata) [15:48:51] (03PS2) 10Ottomata: The rest of kafka stuff is configured in Horizon hiera interface. Let's set this properly there...doing... :) [puppet] - 10https://gerrit.wikimedia.org/r/415308 [15:48:53] (03CR) 10Ottomata: [V: 032 C: 032] The rest of kafka stuff is configured in Horizon hiera interface. Let's set this properly there...doing... :) [puppet] - 10https://gerrit.wikimedia.org/r/415308 (owner: 10Ottomata) [15:50:03] (03PS11) 10Gehel: wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) [15:51:50] (03Abandoned) 10Ema: reload-vcl: discard old VCL after switching to the new one [puppet] - 10https://gerrit.wikimedia.org/r/412737 (owner: 10Ema) [15:52:31] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:59] 10Operations, 10ops-eqiad: Failed power supply redundancy on wdqs2006 - https://phabricator.wikimedia.org/T188501#4010802 (10Papaul) 05Open>03Resolved Power supply was not seal. Fixed [15:56:03] (03Abandoned) 10Dzahn: deployment-prep: fix kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/415307 (owner: 10Dzahn) [15:56:50] RECOVERY - IPMI Sensor Status on wdqs2006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:57:07] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [15:57:30] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:22] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#4010818 (10Andrew) Yes, $CACHE_MISC includes the varnishes, but not lvs. See https://gerrit.wikimedia.org/r/#/c/413194/ You will see that before that patch, I was already using CACHE_MISC. Before... [16:00:22] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010823 (10Papaul) a:05Papaul>03Marostegui Disk replaced [16:02:57] !log draining restbase2007 for eventual reboot for kernel security update [16:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:33] (03PS1) 10Ottomata: Fix for eventlogging cache kafka in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/415315 [16:03:59] (03CR) 10jerkins-bot: [V: 04-1] Fix for eventlogging cache kafka in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/415315 (owner: 10Ottomata) [16:04:11] (03PS1) 10Filippo Giunchedi: Reinstall puppetmaster1002 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/415316 (https://phabricator.wikimedia.org/T184562) [16:04:23] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415317 [16:04:27] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415317 [16:05:07] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4010832 (10jcrespo) ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ``` [16:05:13] (03PS2) 10Ottomata: Fix for eventlogging cache kafka in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/415315 [16:05:36] (03CR) 10jerkins-bot: [V: 04-1] Fix for eventlogging cache kafka in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/415315 (owner: 10Ottomata) [16:06:19] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#4010833 (10Dzahn) Gotcha! Yea, we could add a new one for _just_ LVS hosts and then combine them similar to: ``` $deployable_networks = [ $mw_appserver_networks, $analytics_ne... [16:06:27] (03PS3) 10Ottomata: Fix for eventlogging cache kafka in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/415315 [16:06:38] ACKNOWLEDGEMENT - HP RAID on db2048 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK Jcrespo rebuilding after failure T188286 [16:07:12] (03PS4) 10Mobrovac: [JobQueue] Switch refreshLinks for all but wikipedia and wiktionary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414760 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:08:23] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:21] (03CR) 10Mobrovac: [C: 032] [JobQueue] Switch refreshLinks for all but wikipedia and wiktionary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414760 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:09:27] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10179/cp1052.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415315 (owner: 10Ottomata) [16:09:29] !log ppchelko@tin Started deploy [cpjobqueue/deploy@3622e38]: Enable refreshLinks for all but wikipedia, wiktionary and commons [16:09:31] (03CR) 10Ottomata: [C: 032] Fix for eventlogging cache kafka in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/415315 (owner: 10Ottomata) [16:09:35] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10BBlack) Some of this conversation is confusing! :) I take it the situation is it's a public service behind cache misc, and our internal LVS ranges are used *behind* cache_misc to route t... [16:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:10] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@3622e38]: Enable refreshLinks for all but wikipedia, wiktionary and commons (duration: 00m 41s) [16:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:29] !log rebooting prometheus servers in codfw for kernel security update [16:10:33] (03Merged) 10jenkins-bot: [JobQueue] Switch refreshLinks for all but wikipedia and wiktionary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414760 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:46] (03CR) 10jenkins-bot: [JobQueue] Switch refreshLinks for all but wikipedia and wiktionary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414760 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:12:51] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#4010855 (10BBlack) (to clarify: that we haven't needed an LVS-monitoring-ips ferm rule before implies that all the other services behind internal LVS (including those behind cache_misc) just have op... [16:14:58] (03PS5) 10Rush: rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 (https://phabricator.wikimedia.org/T188266) [16:15:23] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:16:37] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch refreshLinks for all but wikipedia, wiktionary and commons - T185052 (duration: 00m 57s) [16:17:19] (03PS6) 10Rush: rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 (https://phabricator.wikimedia.org/T188266) [16:17:20] PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.167 second response time [16:17:43] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:17:44] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415317 [16:17:45] ^ reboot in progress. master reelection is taking longer than expected, checking [16:17:51] page, gehel need help? [16:17:52] ok [16:17:57] ack [16:18:05] no, it already fixed itself [16:18:08] k [16:18:29] RECOVERY - LVS HTTP IPv4 on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.169 second response time [16:18:49] unrelated question, what is the current state of search cluster? active-active, active-passive? [16:18:53] looks like master re-election took a bit longer than usual, not sure why... [16:19:21] between dcs, I mean [16:19:36] jynus: technically active-active, but no one is consuming the one in codfw... [16:19:48] (03CR) 10Rush: [C: 032] rabbitmq: move management to plugins manifest [puppet] - 10https://gerrit.wikimedia.org/r/415279 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:19:49] ah, because mediawiki is passive [16:19:50] does that count as active-active? [16:19:55] I get you [16:19:56] jynus: yep [16:20:18] <_joe_> jynus: https://config-master.wikimedia.org/discovery/services.yaml has that info [16:20:20] but as soon as mediawiki is active in codfw, elasticsearch is ready [16:20:23] <_joe_> although the format is not great [16:21:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415317 (owner: 10Marostegui) [16:22:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415317 (owner: 10Marostegui) [16:22:24] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415317 (owner: 10Marostegui) [16:23:09] (03PS2) 10Filippo Giunchedi: Reinstall puppetmaster1002 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/415316 (https://phabricator.wikimedia.org/T184562) [16:23:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 after alter table (duration: 00m 56s) [16:25:36] (03CR) 10Filippo Giunchedi: [C: 032] Reinstall puppetmaster1002 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/415316 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [16:27:04] (03PS1) 10Rush: openstack: include rabbitmq::plugin for management plugin [puppet] - 10https://gerrit.wikimedia.org/r/415322 (https://phabricator.wikimedia.org/T188266) [16:27:34] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:37] (03CR) 10jerkins-bot: [V: 04-1] openstack: include rabbitmq::plugin for management plugin [puppet] - 10https://gerrit.wikimedia.org/r/415322 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:28:36] (03PS2) 10Rush: openstack: include rabbitmq::plugin for management plugin [puppet] - 10https://gerrit.wikimedia.org/r/415322 (https://phabricator.wikimedia.org/T188266) [16:28:48] (03PS3) 10Rush: openstack: include rabbitmq::plugin for management plugin [puppet] - 10https://gerrit.wikimedia.org/r/415322 (https://phabricator.wikimedia.org/T188266) [16:29:43] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:03] (03CR) 10Rush: [C: 032] openstack: include rabbitmq::plugin for management plugin [puppet] - 10https://gerrit.wikimedia.org/r/415322 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:32:33] PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:32:44] RECOVERY - DPKG on rhodium is OK: All packages OK [16:35:26] (03CR) 10Mark Bergsma: [C: 04-1] "This is arguably worse, it's longer and uglier when actually logging (which is where it counts the most :)." [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [16:35:42] (03PS1) 10Filippo Giunchedi: puppetmaster: naggen2 depends on python-requests [puppet] - 10https://gerrit.wikimedia.org/r/415327 (https://phabricator.wikimedia.org/T184562) [16:36:05] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: naggen2 depends on python-requests [puppet] - 10https://gerrit.wikimedia.org/r/415327 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [16:36:50] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 10207.75 seconds Jcrespo backups running, delaying replication [16:39:42] (03PS2) 10Filippo Giunchedi: puppetmaster: naggen2 depends on python-requests [puppet] - 10https://gerrit.wikimedia.org/r/415327 (https://phabricator.wikimedia.org/T184562) [16:40:54] (03PS1) 10Muehlenhoff: Add conftool::scripts to Prometheus servers [puppet] - 10https://gerrit.wikimedia.org/r/415328 [16:41:03] (03PS1) 10Rush: openstack: glance for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/415329 (https://phabricator.wikimedia.org/T188266) [16:41:07] I uploaded some files with chunked file upload which failed. The message is I can resume upload process. [16:41:24] (03CR) 10Smalyshev: [C: 031] wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [16:41:26] (03CR) 10jerkins-bot: [V: 04-1] Add conftool::scripts to Prometheus servers [puppet] - 10https://gerrit.wikimedia.org/r/415328 (owner: 10Muehlenhoff) [16:41:54] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:42:25] (03CR) 10Rush: [C: 032] openstack: glance for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/415329 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:43:11] I'm taking a look at the ganeti puppet failures btw [16:44:57] !log draining restbase2008 for eventual reboot for kernel security update [16:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:10] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4010948 (10elukey) [16:47:33] PROBLEM - puppet last run on ganeti1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:21] Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Err [16:48:25] or while evaluating a Function Call, Could not find data item profile::ganeti::ganeti01.svc.eqiad.wmnet::nodes in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/ganeti.pp:4:21 [16:48:39] (03PS4) 10Vgutierrez: Provide an UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) [16:48:43] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:14] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:53] (03CR) 10Ema: [C: 031] Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [16:52:34] (03PS5) 10Vgutierrez: Provide an UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) [16:57:13] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [16:59:09] ^ icinga -v shows "Things look okay - No serious problems were detected" [16:59:32] uughhh yeah that's the stretch puppet master [16:59:45] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:23] (03CR) 10Ema: [C: 031] Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [17:01:39] (03PS3) 10Vgutierrez: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) [17:02:20] (03CR) 10jerkins-bot: [V: 04-1] Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [17:06:03] (03PS4) 10Vgutierrez: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) [17:07:15] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [17:08:28] (03CR) 10Ema: [C: 04-1] reload-vcl refactors/improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [17:09:04] (03CR) 10Vgutierrez: Provide an UDP monitor (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) (owner: 10Vgutierrez) [17:09:47] (03PS1) 10Filippo Giunchedi: puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) [17:10:14] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:10:33] (03PS2) 10Filippo Giunchedi: puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) [17:11:03] (03PS3) 10Filippo Giunchedi: puppetmaster: naggen2 depends on python-requests [puppet] - 10https://gerrit.wikimedia.org/r/415327 (https://phabricator.wikimedia.org/T184562) [17:11:05] (03PS3) 10Filippo Giunchedi: puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) [17:11:10] ^ the fix for einsteinium fails [17:11:16] (03PS1) 10Marostegui: tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 [17:12:01] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:12:05] (03CR) 10jerkins-bot: [V: 04-1] tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 (owner: 10Marostegui) [17:12:57] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: naggen2 depends on python-requests [puppet] - 10https://gerrit.wikimedia.org/r/415327 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:13:10] (03PS2) 10Marostegui: tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 [17:13:45] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:14:19] (03CR) 10BBlack: [C: 031] reload-vcl refactors/improvements (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [17:14:21] (03PS4) 10Filippo Giunchedi: puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) [17:14:42] (03PS4) 10BBlack: reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) [17:14:44] (03PS4) 10BBlack: Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) [17:14:46] (03CR) 10Vgutierrez: [C: 031] "looks good :D just a tiny comment" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/415294 (owner: 10Mark Bergsma) [17:15:26] (03CR) 10Muehlenhoff: puppetmaster: naggen2 depends on python-requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415327 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:16:51] (03PS3) 10Marostegui: tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 (https://phabricator.wikimedia.org/T184704) [17:17:44] (03PS4) 10Marostegui: tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 (https://phabricator.wikimedia.org/T184704) [17:19:26] (03CR) 10Marostegui: "Puppet compiler says it does what it supposed to do: https://puppet-compiler.wmflabs.org/compiler03/10186/" [puppet] - 10https://gerrit.wikimedia.org/r/415336 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [17:19:45] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:21:48] (03Abandoned) 10Vgutierrez: Make flake8 happy [debs/pybal] - 10https://gerrit.wikimedia.org/r/413698 (owner: 10Vgutierrez) [17:22:16] (03Abandoned) 10Vgutierrez: Report coverage stats. Configure flake8 properly. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 (owner: 10Vgutierrez) [17:22:35] (03CR) 10Ema: [C: 031] reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [17:24:03] (03CR) 10Muehlenhoff: [C: 031] puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:24:08] (03CR) 10Filippo Giunchedi: puppetmaster: naggen2 depends on python-requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415327 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:24:20] (03PS2) 10Mark Bergsma: Add a test case for removing previously existing servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/415294 [17:24:49] (03CR) 10Jcrespo: [C: 031] tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [17:25:05] (03PS5) 10Marostegui: tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 (https://phabricator.wikimedia.org/T184704) [17:25:15] (03PS4) 10Dzahn: icinga: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409204 [17:25:43] (03CR) 10Marostegui: [C: 032] tendril.my.cnf.ebr: Enable/disable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/415336 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [17:26:14] (03PS20) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:26:45] (03CR) 10Mark Bergsma: [C: 032] Add a test case for removing previously existing servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/415294 (owner: 10Mark Bergsma) [17:27:02] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4011098 (10Marostegui) [17:27:04] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:27:22] (03Merged) 10jenkins-bot: Add a test case for removing previously existing servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/415294 (owner: 10Mark Bergsma) [17:29:35] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:31:10] godog: I don't see the problem on ganety, the hey exists on hiera [17:31:25] is this from a new deploy? [17:31:33] (03PS21) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:32:12] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:32:36] (03PS1) 10Filippo Giunchedi: puppetmaster: clarify why python-mysqldb is present [puppet] - 10https://gerrit.wikimedia.org/r/415340 [17:32:38] (03PS1) 10Filippo Giunchedi: hieradata: depool rhodium, bring back puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/415341 (https://phabricator.wikimedia.org/T184562) [17:32:52] jynus: ^ yeah puppet master on stretch, I think is the culprit [17:33:21] (03CR) 10Muehlenhoff: [C: 031] puppetmaster: clarify why python-mysqldb is present [puppet] - 10https://gerrit.wikimedia.org/r/415340 (owner: 10Filippo Giunchedi) [17:33:21] oh, you think there was a diference on key parsing? [17:33:54] jynus: sth like that yeah, different hiera version [17:33:55] is there a way to force puppet agent to use a specific puppetmaster? [17:34:55] moritzm: glad you asked! yes there is, see commit of https://gerrit.wikimedia.org/r/c/415299/ [17:35:30] (03PS22) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:35:37] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:35:44] (03PS5) 10Filippo Giunchedi: puppetmaster: capture warnings in logging for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/415335 (https://phabricator.wikimedia.org/T184562) [17:36:06] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:36:40] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: clarify why python-mysqldb is present [puppet] - 10https://gerrit.wikimedia.org/r/415340 (owner: 10Filippo Giunchedi) [17:36:47] (03PS2) 10Filippo Giunchedi: puppetmaster: clarify why python-mysqldb is present [puppet] - 10https://gerrit.wikimedia.org/r/415340 [17:37:00] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: depool rhodium, bring back puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/415341 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:37:04] godog: thanks! [17:37:08] (03PS2) 10Filippo Giunchedi: hieradata: depool rhodium, bring back puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/415341 (https://phabricator.wikimedia.org/T184562) [17:38:02] !log phab2001 - downtimed, rebooting for kernel upgrade [17:38:13] (03PS3) 10Filippo Giunchedi: hieradata: depool rhodium, bring back puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/415341 (https://phabricator.wikimedia.org/T184562) [17:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:19] (03PS23) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:38:51] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:38:56] (03PS12) 10Gehel: wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) [17:39:05] PROBLEM - Host phab2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:39:38] (03PS13) 10Gehel: wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) [17:39:52] !log milimetric@tin Started deploy [analytics/refinery@e551744]: Update sqoop job and orm artifact [17:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:25] RECOVERY - Host phab2001 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [17:40:48] (03PS24) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:41:03] (03CR) 10Gehel: [C: 032] wdqs: icinga check for categories updates [puppet] - 10https://gerrit.wikimedia.org/r/415010 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [17:41:22] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:41:28] (03PS1) 10Urbanecm: Clean obsolete throttle requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415344 [17:43:13] that was a nice snippet re: how to run on a test puppetmaster. i added that to wiki https://wikitech.wikimedia.org/wiki/Puppet#force_puppet_agent_to_use_a_specific_puppetmaster [17:44:02] (03PS1) 10Gehel: wdqs: fix path to new nagios check [puppet] - 10https://gerrit.wikimedia.org/r/415345 (https://phabricator.wikimedia.org/T188293) [17:44:15] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415346 [17:44:32] (03CR) 10Gehel: [C: 032] wdqs: fix path to new nagios check [puppet] - 10https://gerrit.wikimedia.org/r/415345 (https://phabricator.wikimedia.org/T188293) (owner: 10Gehel) [17:44:38] (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415346 (https://phabricator.wikimedia.org/T188529) [17:45:38] mutante: nice! yeah that still require that the frontend is configured to send requests to rhodium (the "test" puppetmaster) though [17:45:45] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:05] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:06] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:12] !log Finished running CLL preference migration script on terbium (T187677) [17:46:15] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:15] PROBLEM - Check size of conntrack table on mw1310 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [17:46:16] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:16] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 75 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:46:16] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:25] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:27] T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677 [17:46:36] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:36] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:37] !log milimetric@tin Finished deploy [analytics/refinery@e551744]: Update sqoop job and orm artifact (duration: 06m 45s) [17:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:16] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:16] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:16] PROBLEM - puppet last run on wdqs2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_wdqs_categories.py] [17:47:26] PROBLEM - puppet last run on mw1344 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:26] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:26] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:35] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:35] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:39] <_joe_> uhm [17:47:45] <_joe_> nitrogen or what? [17:47:45] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:46] PROBLEM - Check size of conntrack table on mw1311 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [17:47:56] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:56] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:06] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:06] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:21] <_joe_> yes, nitrogen [17:48:25] (03PS25) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:48:35] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:45] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:59] <_joe_> uhm interestingly, puppetdb is still up [17:49:00] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:49:28] yeah I was about to say that, puppetdb shows the last restart ~17h ago [17:49:36] <_joe_> also seems to be working [17:49:38] <_joe_> uhm [17:49:45] PROBLEM - puppet last run on db1081 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:45] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:45] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:45] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:46] let me retry on one [17:49:55] I just repooled puppetmaster1002 but that should be hitless [17:49:56] <_joe_> we already had a case like this [17:50:11] you think routing? [17:50:17] <_joe_> no [17:50:25] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:36] I'm getting "execution expired" from nitrogen.eqiad.wmnet [17:50:45] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:48] PROBLEM - puppet last run on dns4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:51] <_joe_> can someone check the grafana data for puppetdb? [17:50:56] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:56] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:56] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:56] PROBLEM - puppet last run on mw1333 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:56] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:01] yeah, puppet run gets stalled [17:51:05] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:05] PROBLEM - puppet last run on mw1315 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:05] PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:06] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:06] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:06] PROBLEM - puppet last run on mw1342 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:15] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:15] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:15] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:15] PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:25] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:51:26] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:26] PROBLEM - puppet last run on mw1329 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:35] PROBLEM - puppet last run on mw1328 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:36] PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:36] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:36] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:46] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:59] <_joe_> https://grafana.wikimedia.org/dashboard/db/puppetdb?orgId=1 is strange [17:52:01] _joe_: activity seems down [17:52:05] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:05] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:05] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:05] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:05] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:05] PROBLEM - puppet last run on druid1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:05] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:06] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:06] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:07] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:08] <_joe_> yes [17:52:15] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:15] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:15] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:15] PROBLEM - puppet last run on ores1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:16] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:16] PROBLEM - puppet last run on mw1330 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:16] PROBLEM - puppet last run on cp4032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:26] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:26] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:26] some stall? [17:52:32] <_joe_> can someone kill icinga-wm please? [17:52:35] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:35] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:36] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:36] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:36] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:36] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:36] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:37] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:37] PROBLEM - puppet last run on db1098 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:38] PROBLEM - puppet last run on mw1338 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:38] PROBLEM - puppet last run on wtp1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:40] PROBLEM - puppet last run on logstash1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:49] I will [17:52:56] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:56] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:53:07] <_joe_> I would say the problem is not with puppetdb [17:53:16] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:53:42] it is going through, very slowly [17:53:49] (03PS26) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:53:54] a no, execution expired [17:54:04] <_joe_> godog: can we depool puppetmaster1002? [17:54:10] <_joe_> it is indeed receiving traffic [17:54:20] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:54:30] _joe_: sure, we'd be left with just puppetmaster1001 but I think it has enough capacity? [17:54:37] <_joe_> godog: ouch [17:54:39] <_joe_> nop [17:54:41] <_joe_> nope [17:54:50] why not? [17:54:51] <_joe_> but look at syslog on puppetmaster1002 [17:54:51] network connectivity seems fine [17:55:09] (03PS2) 10Cmjohnson: updating netboot.cfg for analytics1070-77 [puppet] - 10https://gerrit.wikimedia.org/r/415184 (https://phabricator.wikimedia.org/T188294) [17:55:12] <_joe_> godog: let's move everyone to use the codfw infrastrcture [17:55:49] <_joe_> have you noiticed all the errors are from eqiad? [17:55:58] <_joe_> so it's not puppetdb [17:56:19] <_joe_> seems like puppetmaster1002 fails to submit facts to puppetdb [17:56:21] <_joe_> wtf [17:56:43] (03PS27) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:56:48] I'll restart apache/passenger on puppetmaster1002 first [17:56:53] <_joe_> yes [17:56:55] <_joe_> good idea [17:57:00] <_joe_> I was about to suggest we do that [17:57:14] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:57:16] <_joe_> just to be clear: puppetmaster1001 is working fine [17:57:59] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboots of dumps/snapshot hosts - https://phabricator.wikimedia.org/T188242#4011193 (10ArielGlenn) [17:58:04] this is odd... [17:58:07] (03PS5) 10Vgutierrez: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) [17:58:09] <_joe_> godog: why is rhodium still depoled? [17:58:18] (03CR) 10Cmjohnson: [C: 032] updating netboot.cfg for analytics1070-77 [puppet] - 10https://gerrit.wikimedia.org/r/415184 (https://phabricator.wikimedia.org/T188294) (owner: 10Cmjohnson) [17:58:32] curl https://nitrogen.eqiad.wmnet was timing out from puppetmaster1002, now it works [17:58:34] <_joe_> godog: also, did you restart apache on 1002? if so log it [17:58:41] _joe_: because it wasn't able to compile all catalogs [17:58:43] <_joe_> herron: wat? [17:58:48] !log restart apache2 on puppetmaster1002 [17:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:25] <_joe_> ok next time this happens, let's look at network instead [17:59:31] puppetdb connection was timing out [17:59:35] what is currently broken, functionality-wise? [17:59:45] <_joe_> ema: puppet on some of eqiad [17:59:46] (03PS28) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [17:59:54] and puppetmaster1002 puppet-master[25859]: Server Error: Failed to submit 'replace facts' command for db1055.eqiad.wmnet to PuppetDB at nitrogen.eqiad.wmnet:443: execution expired [17:59:57] <_joe_> like 66% of runs will fail [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T1800). Please do the needful. [18:00:07] Amir1, kart_, Lucas_WMDE, and Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:07] <_joe_> herron: that's a ruby error [18:00:12] o/ [18:00:16] <_joe_> that's a different thing than what you said [18:00:16] _joe_: ok so I'll stop text@eqiad upgrades for now [18:00:19] * kart_ is here [18:00:19] none of mine is testable :D [18:00:25] <_joe_> you said a curl wouldn't work [18:00:26] (03CR) 10jerkins-bot: [V: 04-1] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:00:30] I’m here [18:00:40] puppetmaster1002 seems doing better [18:00:41] (and my change isn’t testable either) [18:00:51] <_joe_> godog: it's still not repooled by apache [18:01:13] <_joe_> the apache on puppetmaster1001, I mean [18:01:14] * kart_ 's change is highly visible :) [18:01:25] curl wouldnt work and execution expired errors [18:02:06] today kart_ and I are going to deploy a thing [18:02:09] <_joe_> ok, if it happens again, let's check netstat too [18:02:19] I waited for this thing eight years [18:02:35] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/10188/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/409204 (owner: 10Dzahn) [18:02:59] <_joe_> I would expect apache to already have repooled puppetmaster1002, how did we configure proxy parameters? [18:03:09] <_joe_> it's already 3 minutes since your restart [18:03:28] Question is who is SWAT'ng :) zeljkof? [18:03:37] thcipriani: btw, re train, looks like this https://gerrit.wikimedia.org/r/#/c/415223/ needs to be backported to wmf.23, done at https://gerrit.wikimedia.org/r/#/c/415351/ (well, someone needs to +2) [18:03:50] kart_: not me :) [18:04:18] _joe_: indeed should be repooled already by now, I'll take a look [18:04:24] legoktm: just to confirm, all that needs to happen now with the deletion log is backporting to wmf.23 and rolling forward right? any need to backport to wmf.22? [18:04:32] greg-g: sure, I noticed ping last night + movement this morning. [18:04:48] I think its just in .23 [18:04:48] thcipriani: ack, just trying to get ducks in their rows :) [18:04:58] :) [18:05:14] legoktm: will there be follow-up to fix the wrong db entries? [18:05:18] thcipriani: SWAT? [18:05:31] I don't know if that's possible [18:05:36] legoktm: kk :( [18:05:38] We can probably guess [18:05:52] But that would just be manual work [18:05:55] <_joe_> ping=1 connectiontimeout=1 retry=500 timeout=900 [18:06:02] !log stop and restart apache2 on puppetmaster1002 [18:06:03] legoktm: right right, makes sense. :/ [18:06:06] <_joe_> so it's 4 minutes [18:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:18] kart_: I have a meeting in 10 minutes so I can't :( [18:06:35] (03PS6) 10Vgutierrez: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) [18:07:28] _joe_: I'm tempted to graceful apache2 on puppetmaster1001 instead of waiting, iirc that should poke mod_proxy too [18:07:36] <_joe_> godog: we restarted apache just when it got it straffic back, it seems [18:07:53] (03PS2) 10Milimetric: Update the cron command with the new sqoop script [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) [18:07:56] I can SWAT. [18:07:58] <_joe_> at 2018-02-28T18:05:51 [18:08:08] <_joe_> anyways, let's wait [18:08:15] <_joe_> puppet should be recovering, right? [18:08:29] <_joe_> godog: we should really re-add rhodium though if it's ok [18:08:33] (03PS5) 10Niharika29: Deploy Compact Language Links out of Beta on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677) (owner: 10KartikMistry) [18:08:39] aude MaxSem Niharika - SWAT anyone? [18:08:54] kart_: See my last message. :) [18:08:57] _joe_: heh I don't think it is ok yet no [18:09:02] kart_: Doing your change first. [18:09:13] oh. wow. This will be special, right Niharika? :) [18:09:17] <_joe_> godog: why? [18:09:20] No-one could be more appropriate that Niharika :) [18:09:25] Indeed! :D [18:09:43] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677) (owner: 10KartikMistry) [18:09:50] !log rebooting dataset1001 (dumps.wm.o) for new kernel [18:09:58] _joe_: hiera version changed in stretch, not all catalogs are compiling successfully (e.g. ganeti) [18:10:00] <_joe_> oh still issues with stretch I guess [18:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:08] <_joe_> godog: oh, that's interesting [18:10:17] <_joe_> so we're getting a transition to hiera 3? [18:10:40] (03CR) 10Vgutierrez: "After discussing it with @mark we decided to go for the KISS approach. Let's refactor this in the future if it becomes hard to maintain." [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [18:10:52] Lucas_WMDE: Tests are failing on https://gerrit.wikimedia.org/r/#/c/415319/ [18:10:56] (03Merged) 10jenkins-bot: Deploy Compact Language Links out of Beta on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677) (owner: 10KartikMistry) [18:11:11] (03CR) 10jenkins-bot: Deploy Compact Language Links out of Beta on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677) (owner: 10KartikMistry) [18:11:16] Niharika: yeah… I am almost certain it’s unrelated, but I couldn’t figure out why they fail [18:11:30] (03PS29) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [18:11:32] (03PS1) 10Andrew Bogott: mediawiki: move ::hhvm::admin include into mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/415353 [18:11:53] there’s still a lot of jobs in Zuul, so I don’t want to submit an empty change just to see if the test failure also happens on other Wikibase changes [18:12:22] feel free to skip the change – the bug it fixes can’t be triggered right now as far as I’m aware [18:12:22] kart_: aharoni: It's on mwdebug1002. Test! :) [18:12:37] !log force a puppet run on failed hosts in eqiad for recovery [18:12:38] testing! [18:12:40] so it’s more a defense in depth / preparation for future config change thing [18:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:59] Lucas_WMDE: Alright, I'll skip it for now just to be safe. We should merge it when the tests have been fixed. [18:13:11] okay, thanks [18:13:20] Niharika: looks good to me. wait till aharoni acks. [18:13:31] * Niharika waits [18:14:12] logged-out: tested! [18:14:19] now testing logged-in... [18:14:40] Ooh it also works for anons? That's nice. [18:14:51] Niharika: that's the whole point :) [18:14:55] (03PS2) 10Niharika29: Enable Wikibase RC injection for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415078 (owner: 10Ladsgroup) [18:15:06] for logged-in it has been a beta feature for over three years [18:15:12] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415078 (owner: 10Ladsgroup) [18:15:16] now it will be available for hundreds of millions of anons! [18:15:26] Wooo! [18:15:51] aharoni: I had both things opened, so was able to test quickly ;) [18:16:23] (03Merged) 10jenkins-bot: Enable Wikibase RC injection for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415078 (owner: 10Ladsgroup) [18:16:30] 1 out of 4 tested - works [18:16:32] Niharika: you think we can update the interwiki map once you've finished with that patch? [18:16:47] aharoni: Ready to sync? [18:16:47] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboots of dumps/snapshot hosts - https://phabricator.wikimedia.org/T188242#4011234 (10ArielGlenn) [18:16:55] 2 out of 4 tested - works [18:17:10] Hauskatze: If you can tell me how, sure! I haven't done that before. [18:17:10] 3 out of 4 tested - works [18:17:20] !log gerrit2001 - reboot for kernel upgrade [18:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:45] 4 out of 4 tested - works [18:18:02] Alright then, syncing it now! [18:18:08] Niharika kart_ - ready to go!!! [18:18:24] aharoni: cool. Good testing! :) [18:19:08] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Deploy Compact Language Links out of Beta on English Wikipedia T187677 (duration: 00m 58s) [18:19:09] (I created four testing user accounts with different settings. All work correctly.) [18:19:19] aharoni: kart_ And it's out! [18:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:22] T187677: Deploy Compact Language Links on the English Wikipedia - https://phabricator.wikimedia.org/T187677 [18:19:57] Amir1: https://gerrit.wikimedia.org/r/#/c/415078/ is on mwdebug1002 [18:19:58] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboots of dumps/snapshot hosts - https://phabricator.wikimedia.org/T188242#4011251 (10ArielGlenn) snapshot1007 and dumpsdata1001 can't be rebooted right now due to the weekly wikidata nt dumps. And tomorrow the full xml/sql run starts, so we'll... [18:20:15] Niharika: none of them are testable :D [18:20:32] Amir1: Nothing will break, right? :P [18:20:32] Niharika: sure, ping me when you're finished [18:20:49] I'm rather twitchy about deploying stuff lately. [18:20:54] Hauskatze: Will do. [18:21:09] Thanks Niharika and Amir! [18:21:10] (03CR) 10jenkins-bot: Enable Wikibase RC injection for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415078 (owner: 10Ladsgroup) [18:21:25] aharoni I mean :) [18:21:27] Thanks kart_. :) [18:22:31] Niharika: nope [18:22:36] (03PS2) 10Niharika29: Enable reading from full term entity id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415080 (https://phabricator.wikimedia.org/T114903) (owner: 10Ladsgroup) [18:22:43] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415080 (https://phabricator.wikimedia.org/T114903) (owner: 10Ladsgroup) [18:22:50] Niharika: I mean it can break but the good thing is that it will take days to show up [18:23:09] !log niharika29@tin Synchronized wmf-config/Wikibase-production.php: Enable Wikibase RC injection for ruwiki [mediawiki-config] - https://gerrit.wikimedia.org/r/415078 (duration: 00m 57s) [18:23:22] and I'm monitoring it closely [18:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:25] Amir1: That is an excellent feature. [18:23:34] Amir1: First one deployed. [18:23:55] (03Merged) 10jenkins-bot: Enable reading from full term entity id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415080 (https://phabricator.wikimedia.org/T114903) (owner: 10Ladsgroup) [18:23:57] Meta seems a little slow right now. [18:24:01] Might just be me. [18:24:12] (03PS3) 10Niharika29: Reduce the batch size of statment usage tracking to 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [18:24:17] <_joe_> tzatziki: define slow :) [18:24:19] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [18:24:20] <_joe_> what is slow? [18:24:27] _joe_: "took too long to respond" slow [18:24:27] <_joe_> loading pages? saving edits? [18:24:30] Loading pages [18:24:31] sorry [18:24:34] <_joe_> ok :) [18:24:48] (Am in the middle of other work / meeting atm, just wanted to flag in case it was an issue) [18:24:52] might just be me :) [18:25:00] <_joe_> tzatziki: I can't reproduce :) [18:25:32] So yeah, just me :D [18:25:32] Hmm, it's extremely slow for me, too. [18:25:32] (03Merged) 10jenkins-bot: Reduce the batch size of statment usage tracking to 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [18:25:37] oh [18:25:37] _joe_: I can reproduce what tzatziki is saying. It's very very slow. [18:25:41] Even english. [18:25:54] i tried enwiki, dewiki wikivoyage [18:26:04] <_joe_> so everything across different shards is slow [18:26:09] Where are you bearND? geographically? [18:26:14] (03CR) 10jenkins-bot: Enable reading from full term entity id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415080 (https://phabricator.wikimedia.org/T114903) (owner: 10Ladsgroup) [18:26:17] (03CR) 10jenkins-bot: Reduce the batch size of statment usage tracking to 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415264 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [18:26:20] Seems normal speed for me from the UK [18:26:29] central US [18:26:36] It's really fast, from Germany [18:26:50] it might be ulsfo [18:26:53] <_joe_> from EU in general [18:27:08] <_joe_> bearND: can you tell me what IP you get for 'en.wikipedia.org'? [18:27:18] https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [18:27:45] <_joe_> tzatziki: where are you located, broadly? [18:27:56] _joe_: San Francisco [18:28:00] https://www.irccloud.com/pastebin/Ti6XvkKG/ [18:28:20] <_joe_> ok, I'm pretty sure this is something with local transport OR ulsfo then [18:28:25] pybal on lvs4006 had issues querying dns4002.wikimedia.org, perhaps something wrong in ulsfo [18:28:29] i can use meta wiki from over on US west coast [18:28:41] i mean, pages load normal [18:28:53] jouncebot: next [18:28:54] In 0 hour(s) and 31 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T1900) [18:29:02] It's better now. [18:29:09] <_joe_> nah bearND is via codfw [18:29:36] something deployed at :23 ? [18:29:40] slowdown since then [18:29:41] no_justification: FYI, I'm in the middle of SWAT. [18:29:47] Was wondering [18:29:47] is phabricator.wikimedia.org equally-slow for reporters? [18:29:50] <_joe_> jynus: where do you get that from? [18:29:58] <_joe_> slowdown since then, I mean [18:29:58] Niharika: congratulations, and thanks! [18:29:59] (as a non-wiki, it separates some concerns) [18:30:01] (03CR) 10Andrew Bogott: [C: 04-1] "This produces a diff on the jobrunners" [puppet] - 10https://gerrit.wikimedia.org/r/415353 (owner: 10Andrew Bogott) [18:30:05] phab is fast for me [18:30:10] aharoni: Thanks for all your work! :) [18:30:14] <_joe_> bearND: but the wikis are as well? [18:30:16] jynus: 18:23:09 +logmsgbot | !log niharika29@tin Synchronized wmf-config/Wikibase-production.php: Enable Wikibase RC injection for ruwiki [mediawiki-config] [18:30:17] maybe :21 or :22 [18:30:22] !log niharika29@tin Synchronized wmf-config/Wikibase-production.php: Enable reading from full term entity id everywhere T114903 (duration: 00m 57s) [18:30:26] shoudl we revert that? [18:30:32] shall we stop SWAT? [18:30:33] puppet should be recovered everywhere now [18:30:34] The wikis seem better to me now. [18:30:34] I'd say yes [18:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:35] <_joe_> jynus: what slowdown are you referring to? [18:30:35] T114903: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903 [18:30:36] oh [18:30:39] <_joe_> greg-g: yes, stop swat [18:30:44] _joe_: actually now the prod sites are fast again [18:30:48] greg-g: That patch is unrelated. [18:30:49] Niharika: stop swatting for a bit :) [18:30:51] kk [18:30:54] Alright. [18:30:55] Niharika: I have to sign off IRC, but ping me on Hangout if anything breaks. [18:30:56] is icinga-wm muted from the puppet-level issues? [18:31:01] <_joe_> ok, I'm pretty sure that has nothing to do with the problem [18:31:02] or normal? [18:31:04] aharoni: Cool, will do. [18:31:05] * greg-g is in a 1:1 and trying to get caught up [18:31:05] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:31:06] <_joe_> bblack: it's back now AFAICS [18:31:26] <_joe_> (puppet) [18:31:30] Will repeat what I said above - it's back to being fast as usual for me. [18:31:32] yes, it is back up again [18:31:36] <_joe_> tzatziki: still experiencing slowness? [18:31:42] it's back, but what? [18:31:48] <_joe_> jynus: where do you get the slowness information? [18:31:54] _joe_: seems better now, on my end [18:31:54] _joe_: I am using mysql trafic stats as a way to evaluate traffic slowdown [18:31:56] Tyler and I will go back to our 1:1, ping us if you need us [18:32:16] _joe_: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1519821127851&to=1519842727851&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [18:32:18] <_joe_> jynus: oh ok, so I'm pretty positive it has nothing to do with a deployment [18:32:24] Niharika: any more swat items? we might want to just hold tight until we figure out what happened [18:32:24] <_joe_> but with a traffic issue [18:32:24] !log puppet reenable on einsteinium [18:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:56] there was a 10% less requests between now and 10 minutes ago [18:33:05] that is my only analysis [18:33:10] greg-g: Yes, 3 more. One of them on tin already. [18:33:22] _joe_: concur with holding swat until you figure out what happened? [18:33:31] <_joe_> jynus: yes, that coincides with my assessment it's a traffic-level issue [18:33:59] <_joe_> greg-g: yes, but I'd ask bblack if he's looking into something right now [18:34:05] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [18:34:06] the overall traffic stats don't show a big drop though (but most of those are hits...) [18:34:11] sorry to confuse you- I tried to use performance metrics, but they were unreliable [18:34:38] <_joe_> < icinga-wm> RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 309 (alerts on 19) [18:34:41] I use mysql metrics as an "aggregated metrics stats" [18:34:51] <_joe_> ook, that seems to confirm we had issues there :) [18:35:01] <_joe_> jynus: they're a good indicator indeed [18:35:12] <_joe_> if less queries reach mysql, something is wrong [18:35:22] i more queries reach mysql, something is also wrong [18:35:23] :D [18:35:32] <_joe_> if too many more, indeed! [18:35:34] may be wrong* [18:35:43] "aggregated backend stats" I meant [18:35:50] <_joe_> bblack: I think we can resume SWAT, though [18:35:53] <_joe_> what do you think? [18:36:09] I think we might've had two unrelated issues. [18:36:17] 2? [18:36:17] are things back to normal for reporters? [18:36:23] The traffic thing + db increase [18:36:28] I see what Krinkle is getting at. [18:36:32] db increase? [18:36:34] <_joe_> bblack: yes [18:36:37] which shard? [18:36:49] *section [18:37:12] the traffic graphs look like an event we already recovered from too, but I can't speak to the other side of our possible "two issues" above [18:37:18] I don't have any graphs in front of me, I'm mobile, I'm just putting together what you + Krinkle said + swat config changes to RC/Wikibase [18:37:26] s8 [18:37:27] Ignore me [18:37:29] let me look [18:37:45] <_joe_> I think no_justification read the joke Krinkle made as an assertion [18:37:48] <_joe_> :P [18:37:54] ah [18:38:01] I haven't looked at anything other than reading the deploy log (RC injection enabled for wikidata) and the fact people seem to be in the sense of troubles happening. I know that feature was explicitly turned off for db perf reasons earlier this year. [18:38:04] <_joe_> the real issue is extreme slowness reported by users [18:38:05] what I was saying [18:38:12] was bblack issue [18:38:18] So unless there is substantial indication they have fixed that, I cannot possibly be re-enabled without question. [18:38:25] But... for all I knw there is such indication, I haven't looked. [18:38:25] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 9 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:38:38] objetiviced trough mysql metrics, I did not see anything else [18:38:49] * no_justification missed the smiley after Krinkle [18:38:58] :) sorry for the confusion [18:39:04] <_joe_> eheh [18:39:06] <_joe_> ok [18:39:10] Joke -------/ Me \---------> [18:39:15] <_joe_> Niharika: IMO, swat can resume [18:39:34] _joe_: Okay. [18:42:42] !log niharika29@tin Synchronized wmf-config/Wikibase.php: Reduce the batch size of statment usage tracking to 33 T151717 (duration: 00m 57s) [18:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:58] T151717: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717 [18:43:13] when I see "RC injection enabled for wikidata" it makes me thing we've intentionally enabled an exploit btw, because I naturally interpret that like: https://www.acronymfinder.com/Remote-Code-Injection-(RCI).html [18:44:00] I suppose it's recent changes? [18:44:05] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Reduce the batch size of statment usage tracking to 33 T151717 (duration: 00m 57s) [18:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:24] Amir1: All of your changes are deployed now. [18:44:32] Niharika: thank you very much [18:44:50] Urbanecm: Around? [18:44:59] Nikerabbit, yep [18:45:17] Sorry, I thought that SWAT is from 19:00 to 20:00 UTC... [18:46:23] jouncebot: next [18:46:23] In 0 hour(s) and 13 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T1900) [18:46:28] jouncebot: now [18:46:28] For the next 0 hour(s) and 13 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T1800) [18:47:09] (03PS2) 10Niharika29: Clean obsolete throttle requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415344 (owner: 10Urbanecm) [18:47:17] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415344 (owner: 10Urbanecm) [18:47:35] Hauskatze: I'll be done in 5 minutes. Can update interwiki then. [18:48:34] (03Merged) 10jenkins-bot: Clean obsolete throttle requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415344 (owner: 10Urbanecm) [18:48:39] Urbanecm: np :) [18:48:39] Niharika: I resign my request. I'd love to do another update as well but I cannot figure out how to get a file from terbium to my local folder. [18:48:46] (03CR) 10jenkins-bot: Clean obsolete throttle requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415344 (owner: 10Urbanecm) [18:49:04] *terbium: dpl-tin * local folder: local download folder [18:49:16] Hauskatze, what about scp command? [18:49:21] Hauskatze: Open file. Copy contents. Paste locally. :P [18:49:30] Niharika, that's another solution :D [18:49:41] (03PS3) 10Niharika29: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415346 (https://phabricator.wikimedia.org/T188529) (owner: 10Urbanecm) [18:49:42] But hard to process if your file is binary and have 200 MBs :D [18:49:47] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415346 (https://phabricator.wikimedia.org/T188529) (owner: 10Urbanecm) [18:49:58] Niharika: it's a php file, format needs to be preserved [18:50:05] (OT: Hauskatze, you have deploy privs?) [18:50:12] on beta [18:50:36] mostly to clean spam directly and update other stuff [18:50:59] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415346 (https://phabricator.wikimedia.org/T188529) (owner: 10Urbanecm) [18:51:25] Hauskatze, as I said, I recommand to have a look at command called "scp". You maybe need to transfer your file to a bastion and then use program WinSCP (if you use windows) to copy from bastion to localhost [18:51:58] I had WinSCP installed but I got rid of it. I guess I'll have to re-download it :| [18:52:19] there's a 'scp' command too, I'll see the docs [18:52:46] Well, all normal linux systems have scp :D [18:53:24] !log niharika29@tin Synchronized wmf-config/throttle.php: Clean obsolete rules and add a new one - T188529 (duration: 00m 56s) [18:53:31] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415346 (https://phabricator.wikimedia.org/T188529) (owner: 10Urbanecm) [18:53:35] Hauskatze: to preserve whitespace, you could pipe the file through `base64` server-side and `base64 -d` on your end [18:53:35] Urbanecm: Both patches synced. [18:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:40] T188529: Lift IP cap on en.wiki for account creation for Jewish Museum NYC - Sunday March 11 - https://phabricator.wikimedia.org/T188529 [18:53:42] (not sure if that’s available on Windows…) [18:53:43] Niharika, thx [18:54:00] Lucas_WMDE: hold your horses, still a noob :-) [18:54:07] okay, sorry :D [18:54:20] :) [18:55:19] (03CR) 10Chad: [V: 032 C: 032] Add GO plugin for gerrit [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/413888 (owner: 10Chad) [18:55:59] !log demon@tin Started deploy [gerrit/gerrit@f16f4a4]: GO plugin [18:56:10] !log demon@tin Finished deploy [gerrit/gerrit@f16f4a4]: GO plugin (duration: 00m 10s) [18:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:47] (03PS1) 10Dzahn: servermon: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415362 [18:59:51] (03PS2) 10Imarlier: [WIP] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:16] (03CR) 10jerkins-bot: [V: 04-1] servermon: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415362 (owner: 10Dzahn) [19:01:54] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:04:22] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4011439 (10Anomie) [19:07:50] (03CR) 10Alexandros Kosiaris: [C: 031] "Aside from the two tab issues jenkins is crying about LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/415362 (owner: 10Dzahn) [19:12:44] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:25] PROBLEM - keystone public endoint port 5000 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:17] andrewbogott: are you the Andrew mentioned in /topic? !log seems to be working fine, so perhaps we can set the topic back now [19:16:40] you're right, I'll put it back [19:16:49] thanks! [19:18:04] (03PS2) 10Dzahn: servermon: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415362 [19:21:34] 10Operations, 10Ops-Access-Requests: reinstate ezachte's access - https://phabricator.wikimedia.org/T188335#4003953 (10Ottomata) Erik emailed me the following: On Wed, Feb 28, 2018 at 12:22 PM, Erik Zachte wrote: > Hi Andrew, can I ask you again to help me with regaining server access?... [19:21:44] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.078 second response time [19:22:25] RECOVERY - keystone public endoint port 5000 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 757 bytes in 0.080 second response time [19:26:13] (03CR) 10Dzahn: [C: 032] "thanks for review. fixed that. and don't see issues here since it's not sharing the node with other services" [puppet] - 10https://gerrit.wikimedia.org/r/415362 (owner: 10Dzahn) [19:29:04] (03CR) 10Chad: [C: 032] wmf-beta-autoupdate: Add --jobs and fix --remote usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415200 (owner: 10Chad) [19:29:09] !log rolling reboot of elasticsearch / cirrus - codfw completed [19:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:35] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:36] (03Merged) 10jenkins-bot: wmf-beta-autoupdate: Add --jobs and fix --remote usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415200 (owner: 10Chad) [19:30:50] (03CR) 10jenkins-bot: wmf-beta-autoupdate: Add --jobs and fix --remote usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415200 (owner: 10Chad) [19:32:10] (03CR) 10Dzahn: [C: 032] "puppet run on netmon1003 - noop" [puppet] - 10https://gerrit.wikimedia.org/r/415362 (owner: 10Dzahn) [19:37:47] (03PS2) 10Dzahn: netmon/netbox/smokeping/librenms: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409210 [19:42:48] (03PS3) 10Dzahn: netmon/netbox/smokeping/librenms: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409210 [19:48:23] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Switch cdnPurge to Kafka - https://phabricator.wikimedia.org/T188540#4011588 (10Pchelolo) p:05Triage>03Normal [19:52:20] !log demon@tin Synchronized README: no-op, forcing co-master sync (duration: 00m 57s) [19:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:09] (03PS1) 10Ppchelko: Disable redis queue for cdnPurge for all but wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415371 (https://phabricator.wikimedia.org/T188540) [19:55:34] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:55:55] PROBLEM - Check size of conntrack table on mw1310 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [19:56:17] (03CR) 10jerkins-bot: [V: 04-1] Disable redis queue for cdnPurge for all but wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415371 (https://phabricator.wikimedia.org/T188540) (owner: 10Ppchelko) [19:57:12] (03PS2) 10Ppchelko: Disable redis queue for cdnPurge for all but wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415371 (https://phabricator.wikimedia.org/T188540) [19:58:54] PROBLEM - Check size of conntrack table on mw1310 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [20:00:04] thcipriani: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:40] * thcipriani does backport [20:01:07] ^fixing mw1310 [20:01:13] (03PS4) 10Muehlenhoff: Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910 [20:02:15] PROBLEM - Check size of conntrack table on mw1311 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [20:03:29] (03CR) 10Muehlenhoff: [C: 032] Add repository component for tor on stretch [puppet] - 10https://gerrit.wikimedia.org/r/410910 (owner: 10Muehlenhoff) [20:03:55] RECOVERY - Check size of conntrack table on mw1310 is OK: OK: nf_conntrack is 71 % full [20:04:15] RECOVERY - Check size of conntrack table on mw1311 is OK: OK: nf_conntrack is 75 % full [20:05:47] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/10202/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/409210 (owner: 10Dzahn) [20:07:05] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_upload to Varnish 5 - https://phabricator.wikimedia.org/T180433#4011659 (10ema) [20:07:08] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_text to Varnish 5 - https://phabricator.wikimedia.org/T184448#4011657 (10ema) 05Open>03Resolved a:03ema [20:07:24] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#4011660 (10ema) 05Open>03Resolved a:03ema [20:11:47] 10Operations, 10Puppet: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4011676 (10herron) p:05Triage>03Normal [20:12:03] (03PS1) 10Ottomata: 1.4.1 release with python3 packaging [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/415378 [20:12:19] (03PS4) 10Dzahn: netmon/netbox/smokeping/librenms: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409210 [20:12:42] (03CR) 10Ottomata: "Tested in labs, fixes https://github.com/dpkp/kafka-python/pull/828" [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/415378 (owner: 10Ottomata) [20:12:46] (03CR) 10Ottomata: [V: 032 C: 032] 1.4.1 release with python3 packaging [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/415378 (owner: 10Ottomata) [20:15:01] (03CR) 10Dzahn: [C: 032] netmon/netbox/smokeping/librenms: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409210 (owner: 10Dzahn) [20:18:45] RECOVERY - MariaDB Slave Lag: s3 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:19:59] dependency cycle .. great puppet .. not caught by compiling it [20:20:50] !log thcipriani@tin Synchronized php-1.31.0-wmf.23/includes/page/WikiPage.php: [[gerrit:415351|WikiPage: Avoid $user variable reuse in doDeleteArticleReal()]] T188479 (duration: 00m 57s) [20:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:05] T188479: Deletion logs on mediawiki.org are not showing the proper user making the deletions - https://phabricator.wikimedia.org/T188479 [20:21:44] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4011697 (10Pchelolo) [20:26:38] legoktm: if you're around I got the backport for wmf.23 synced out and am about to roll forward to group0 again. If you're around and have time I could roll this out to testwiki only if there's anything you'd like to look at before it hits mediawikiwiki again. [20:26:50] sure [20:27:13] ok, I'll got ahead and roll it out on testwiki [20:28:58] !log thcipriani@tin rebuilt and synchronized wikiversions files: testwiki to 1.31.0-wmf.23 [20:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:40] (03PS1) 10Dzahn: librenms/netbox: fix dependency cycle apache<->letsencrypt [puppet] - 10https://gerrit.wikimedia.org/r/415379 [20:30:43] legoktm: ^ testwiki is not on 1.31.0-wmf.23, but this might be brutally slow for a few due to hhvm cache warmup [20:30:56] (03PS2) 10Dzahn: librenms/netbox: fix dependency cycle apache<->letsencrypt [puppet] - 10https://gerrit.wikimedia.org/r/415379 [20:30:56] not or now? [20:31:07] sorry now [20:31:10] :) [20:31:28] (03CR) 10Dzahn: [C: 032] librenms/netbox: fix dependency cycle apache<->letsencrypt [puppet] - 10https://gerrit.wikimedia.org/r/415379 (owner: 10Dzahn) [20:31:31] a very ambiguous typo :) [20:33:31] you really meant brutally slow [20:35:01] icinga-wm behavious suspicously. i'm gonna restart it one more time [20:35:09] yeah, before broader rollout it is pretty awful. [20:35:34] 10Operations, 10Puppet: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4011730 (10herron) While the catalog-compiler (T187258) has been useful to test compilation under the new version of puppetdb I haven't found a straightforward way to compi... [20:36:47] (03PS1) 10Dzahn: netmon: fix PHP version, 7.0 not 7 is correct [puppet] - 10https://gerrit.wikimedia.org/r/415381 [20:37:08] (03CR) 10jerkins-bot: [V: 04-1] netmon: fix PHP version, 7.0 not 7 is correct [puppet] - 10https://gerrit.wikimedia.org/r/415381 (owner: 10Dzahn) [20:37:42] (03PS2) 10Dzahn: netmon: fix PHP version, 7.0 not 7 is correct [puppet] - 10https://gerrit.wikimedia.org/r/415381 [20:37:44] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[ensure_present_mod_php7] [20:38:31] yes, icinga-wm is more talkative right away [20:38:41] and yes, that will be fixed right now [20:38:48] (03CR) 10Dzahn: [C: 032] netmon: fix PHP version, 7.0 not 7 is correct [puppet] - 10https://gerrit.wikimedia.org/r/415381 (owner: 10Dzahn) [20:39:18] (03PS1) 10Herron: add forward/reverse records for new ganeti VM elnath [dns] - 10https://gerrit.wikimedia.org/r/415382 (https://phabricator.wikimedia.org/T188544) [20:41:04] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:41:49] thcipriani: ok, looks fixed to me [20:42:13] (03PS2) 10Andrew Bogott: mediawiki: move ::hhvm::admin include into mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/415353 [20:42:25] legoktm: awesome. Thank you for the sanity check, and for the quick action yesterday! will push out to group0 now. [20:42:44] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:45:35] 10Operations, 10Traffic: Post Varnish 5 migration cleanup - https://phabricator.wikimedia.org/T188545#4011757 (10ema) [20:45:37] (03PS3) 10Andrew Bogott: mediawiki: move ::hhvm::admin include out of mediawiki module [puppet] - 10https://gerrit.wikimedia.org/r/415353 [20:45:44] 10Operations, 10Traffic: Post Varnish 5 migration cleanup - https://phabricator.wikimedia.org/T188545#4011773 (10ema) p:05Triage>03Normal [20:46:14] (03CR) 10Herron: [C: 032] add forward/reverse records for new ganeti VM elnath [dns] - 10https://gerrit.wikimedia.org/r/415382 (https://phabricator.wikimedia.org/T188544) (owner: 10Herron) [20:47:08] (03PS1) 10Thcipriani: Group0 to 1.31.0-wmf.23 (take II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415383 [20:48:12] (03CR) 10Thcipriani: [C: 032] Group0 to 1.31.0-wmf.23 (take II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415383 (owner: 10Thcipriani) [20:49:24] (03Merged) 10jenkins-bot: Group0 to 1.31.0-wmf.23 (take II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415383 (owner: 10Thcipriani) [20:49:38] (03CR) 10jenkins-bot: Group0 to 1.31.0-wmf.23 (take II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415383 (owner: 10Thcipriani) [20:51:43] (03PS1) 10Bstorm: views: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [20:52:23] (03PS4) 10Andrew Bogott: mediawiki: move ::hhvm::admin include out of mediawiki module [puppet] - 10https://gerrit.wikimedia.org/r/415353 [20:53:07] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group0 (back) to 1.31.0-wmf.23 [20:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:03] (03PS2) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [20:57:00] (03PS6) 10Vgutierrez: Provide an UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) [20:58:18] (03CR) 10Vgutierrez: [C: 032] Provide an UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) (owner: 10Vgutierrez) [20:58:27] (03PS5) 10Andrew Bogott: mediawiki: move ::hhvm::admin include out of mediawiki module [puppet] - 10https://gerrit.wikimedia.org/r/415353 [20:58:47] (03Merged) 10jenkins-bot: Provide an UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) (owner: 10Vgutierrez) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180228T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:02:21] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler02/10206/" [puppet] - 10https://gerrit.wikimedia.org/r/415353 (owner: 10Andrew Bogott) [21:04:54] 10Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#4011880 (10ema) >>! In T165651#3704112, @BBlack wrote: > I don't think anything has changed since on Google's end. Do we try harder or just accept it? I guess we should try harder. :) The list... [21:13:08] (03PS6) 10Andrew Bogott: mediawiki: move ::hhvm::admin include out of mediawiki module [puppet] - 10https://gerrit.wikimedia.org/r/415353 [21:13:10] (03PS30) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [21:14:10] (03PS1) 10Rush: openstack: keystone bootstrap setup for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/415392 (https://phabricator.wikimedia.org/T188266) [21:17:14] !log arlolra@tin Started deploy [parsoid/deploy@d376a3c]: Updating Parsoid to 1415a2a [21:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:00] !log arlolra@tin Finished deploy [parsoid/deploy@d376a3c]: Updating Parsoid to 1415a2a (duration: 08m 46s) [21:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:19] (03PS1) 10Herron: install_server: add dhcp and netboot entries for ganeti VM elnath [puppet] - 10https://gerrit.wikimedia.org/r/415452 (https://phabricator.wikimedia.org/T188544) [21:39:06] (03PS2) 10Herron: add dhcp netboot and site.pp entries for ganeti VM elnath [puppet] - 10https://gerrit.wikimedia.org/r/415452 (https://phabricator.wikimedia.org/T188544) [21:40:21] (03CR) 10Herron: [C: 032] add dhcp netboot and site.pp entries for ganeti VM elnath [puppet] - 10https://gerrit.wikimedia.org/r/415452 (https://phabricator.wikimedia.org/T188544) (owner: 10Herron) [21:40:27] (03PS3) 10Herron: add dhcp netboot and site.pp entries for ganeti VM elnath [puppet] - 10https://gerrit.wikimedia.org/r/415452 (https://phabricator.wikimedia.org/T188544) [21:41:16] ok, I'm going to push wmf.23 to group1 to get the train back on track (hopefully) [21:41:33] * thcipriani writes to no one in particular [21:42:57] * greg-g sees [21:45:24] (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415454 [21:46:38] !log Updated Parsoid to 1415a2a (T58756, T169006) [21:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:54] T58756: Parsoid doesn't give external links class="external free|text" - https://phabricator.wikimedia.org/T58756 [21:46:54] T169006: Correctly redirect in Parsoid /transform/wikitext/to/lint endpoint - https://phabricator.wikimedia.org/T169006 [21:48:06] 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4011676 (10Joe) >>! In T188544#4011730, @herron wrote: > While the catalog-compiler (T187258) has been useful to test compilation under the new versio... [21:49:05] (03PS4) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [21:49:09] (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415454 (owner: 10Thcipriani) [21:50:33] (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415454 (owner: 10Thcipriani) [21:52:35] 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4012015 (10herron) >>! In T188544#4012003, @Joe wrote: > Why not use a separate environment for the newer puppetdbquery version? > > That could allow... [21:52:42] (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415454 (owner: 10Thcipriani) [21:53:20] !log milimetric@tin Started deploy [analytics/refinery@fdd6c25]: Fix error due to invalid docopts comment [21:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:26] (03CR) 10Imarlier: [WIP] coal: Process from Kafka instead of from ZMQ (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [21:54:50] (03PS1) 10Paladox: Use the new Gerrit-Managers group [software/gerrit/plugins/wikimedia] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/415460 [21:55:01] no_justification ^^ [21:55:23] Oh didn't see your diff yet [21:55:25] (03Abandoned) 10Paladox: Use the new Gerrit-Managers group [software/gerrit/plugins/wikimedia] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/415460 (owner: 10Paladox) [21:56:31] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group1 to 1.31.0-wmf.23 [21:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:40] !log milimetric@tin Finished deploy [analytics/refinery@fdd6c25]: Fix error due to invalid docopts comment (duration: 04m 19s) [21:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:58] (03PS3) 10Milimetric: Update the cron command with the new sqoop script [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) [22:06:39] (03CR) 10Milimetric: [C: 031] "ok, this is ready for merge, I tested it as best as I could with the puppet compiler jenkins thing (very cool tool)" [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) (owner: 10Milimetric) [22:09:56] (03PS4) 10Ottomata: Update the cron command with the new sqoop script [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) (owner: 10Milimetric) [22:10:07] (03CR) 10Ottomata: [V: 032 C: 032] Update the cron command with the new sqoop script [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) (owner: 10Milimetric) [22:11:04] PROBLEM - High lag on wdqs1005 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:11:33] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group1 back to 1.31.0-wmf.22 T188555 [22:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:47] T188555: Notice: Undefined property: stdClass::$rc_timestamp in /srv/mediawiki/php-1.31.0-wmf.23/includes/specials/SpecialNewpages.php - https://phabricator.wikimedia.org/T188555 [22:12:53] (03PS1) 10Thcipriani: Revert "Group1 to 1.31.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415464 [22:13:03] (03CR) 10Nuria: [C: 031] Update the cron command with the new sqoop script [puppet] - 10https://gerrit.wikimedia.org/r/415217 (https://phabricator.wikimedia.org/T184759) (owner: 10Milimetric) [22:13:14] (03CR) 10Thcipriani: [C: 032] Revert "Group1 to 1.31.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415464 (owner: 10Thcipriani) [22:14:28] (03Merged) 10jenkins-bot: Revert "Group1 to 1.31.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415464 (owner: 10Thcipriani) [22:15:42] (03PS1) 10Ottomata: Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) [22:15:58] (03PS2) 10Ottomata: [WIP] Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) [22:16:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [22:18:21] (03CR) 10jenkins-bot: Revert "Group1 to 1.31.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415464 (owner: 10Thcipriani) [22:24:13] (03PS1) 10BBlack: eqsin: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/415469 [22:24:45] (03CR) 10BBlack: [C: 032] eqsin: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/415469 (owner: 10BBlack) [22:28:49] (03PS1) 10BBlack: eqsin: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/415471 [22:29:37] (03CR) 10BBlack: [C: 032] eqsin: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/415471 (owner: 10BBlack) [22:46:59] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4012153 (10cwdent) @cmjohnson this server also needs the management pass reset [22:47:06] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4012154 (10cwdent) @cmjohnson this server also needs the management pass reset [22:47:10] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#4012155 (10cwdent) @cmjohnson this server also needs the management pass reset [22:55:22] RECOVERY - High lag on wdqs1005 is OK: OK: Less than 30.00% above the threshold [600.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen