[00:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201112T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [01:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201112T0100). [01:24:31] 10Operations, 10Commons, 10MediaWiki-File-management, 10Wikimedia-production-error: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10Krinkle) >>! In T267668#6617648, @jijiki wrote: > @AntiCompositeNumber the feature we have been working on ha... [01:24:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:26:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:43:21] AntiComposite: do you have a repro for the commons issue and/or other reports? [01:44:36] https://en.wikipedia.org/wiki/File:Allan_Shivers_(1956).png / https://commons.wikimedia.org/wiki/File:Allan_Shivers_(1956).png is currently broken [01:45:34] Hm.. [01:45:40] and it remains that way even after several minutes and a purge [01:45:41] that's odd [01:45:48] have you found one that recovered [01:45:54] or do they stay broken [01:46:22] https://en.wikipedia.org/wiki/File:Haus_Concordia.jpg did [01:46:38] as did https://en.wikipedia.org/wiki/File:Profiman.jpg [01:48:13] https://en.wikipedia.org/wiki/File:Allan_Shivers_(1956).png remains broken even after local purge, commons purge, and when using XWD to view it from a diffenet data center with cold Memcached cluster [01:48:21] so not a cache issue for this one at last [01:48:23] least* [01:49:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:40] files seem to appear about 24-30 hours after upload [01:52:13] interseting, could you summarise on the task for a few files you know of, how old they were when you first saw them work and/or how long its been if they are still broken? [01:52:19] I'm trying out something on mwdebug1001 now [01:57:05] ok, so reverting the onhost-memcached config does not make a difference. It should only affect ParserCache anyway, so that didn't really make sense in hindsight. The onhost-memcached does not currently affect most memc traffic, only the ParserCache keys. [01:58:55] now trying wmf.14 instead of wmf.16 on mwdebug1001 [02:00:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:53] Bingo, wmf.14 works [02:11:36] alright, that moves https://phabricator.wikimedia.org/rMW70c4255978f853f9cf0f6950da5a1644e756378b onto the top of the list of probable problems [02:15:58] but I do not have enough memcache or filerepo knowledge to know what's going on there [02:18:00] 10Operations, 10Commons, 10MediaWiki-File-management, 10Wikimedia-production-error: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10Krinkle) So, what we're seeing is that some files remain unavailable for many hours even upto 1.5 day and no... [02:18:18] AntiComposite: indeed [02:56:38] 10Operations, 10Commons, 10MediaWiki-File-management, 10Wikimedia-production-error: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10AntiCompositeNumber) Using the script from T253405#6161498, I found 38 files out of the last 500 uploads (7.6... [03:01:33] At https://commons.wikimedia.org/wiki/File:Wikipedia-logo-en.png the old versions of the file are not available, and clicking on the links results in, eg, "File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7f/archive/7/7f/20080610233540%21Wikipedia-logo-en.png" - could this be related AntiComposite Krinkle ? [03:02:03] unlikely [03:02:22] indeed, not related [03:02:39] that would be somewhere in Swift/Thumbor [03:03:13] just goes to show how little I know about the file system [03:05:56] actually, not thumbor. probably just Swift [03:07:00] yeah, I think thos files are just actually now there [03:07:16] actually not* [03:08:37] yeah, looking at the deletion log, it looks like they got deleted but no one told MediaWiki [03:10:34] there's a good chance they're *somewhere*, just not publicly [03:17:23] (03PS1) 10Krinkle: Revert "filerepo: clean up shared cache keys to avoid key metrics clutter" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640504 (https://phabricator.wikimedia.org/T267668) [03:25:39] 10Operations, 10Commons, 10MediaWiki-File-management, 10Patch-For-Review, 10Wikimedia-production-error: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10Krinkle) > likely culprit: https://gerrit.wikimedia.org/r/c/mediawiki/core/+... [03:26:00] (03CR) 10Krinkle: [C: 03+2] Revert "filerepo: clean up shared cache keys to avoid key metrics clutter" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640504 (https://phabricator.wikimedia.org/T267668) (owner: 10Krinkle) [03:29:09] PROBLEM - kartotherian endpoints health on maps2003 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [03:30:49] RECOVERY - kartotherian endpoints health on maps2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [03:48:12] (03Merged) 10jenkins-bot: Revert "filerepo: clean up shared cache keys to avoid key metrics clutter" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640504 (https://phabricator.wikimedia.org/T267668) (owner: 10Krinkle) [03:53:44] 10Operations, 10Commons, 10MediaWiki-File-management, 10Patch-For-Review, 10Wikimedia-production-error: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10Krinkle) I'm unable to find anyone online to approve or be deploy buddy. The next perso... [03:53:45] !log krinkle@deploy1001 I've locked scap deployments. See https://phabricator.wikimedia.org/T267668#6619762 [03:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:57] (03PS1) 10Krinkle: Revert "Revert "filerepo: clean up shared cache keys to avoid key metrics clutter"" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640505 [04:02:02] (03CR) 10Krinkle: [V: 03+2 C: 03+2] Revert "Revert "filerepo: clean up shared cache keys to avoid key metrics clutter"" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640505 (owner: 10Krinkle) [04:03:36] (03PS1) 10Krinkle: Revert "filerepo: clean up shared cache keys to avoid key metrics clutter" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640746 (https://phabricator.wikimedia.org/T267668) [04:04:50] !log cleaned up and unlocked [04:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:28] Krinkle just out of curiosity, what exactly does "lock[ing] scap deployments" mean and how is it done, and are the changes deployed? [04:53:39] preventing deploys to avoid conflicts [04:53:57] yeah, but how do you do it? [05:00:40] there's a lockfile [05:01:02] if it exists, scap don't scap [05:01:18] yeah, I'm just wondering what the command is to lock [05:05:01] looks like there's a not-super-well-documented `scap lock` [05:05:08] https://doc.wikimedia.org/mw-tools-scap/_modules/scap/main.html#LockManager [05:41:06] krinkle@deploy1001$ touch /run/lock/scap-global-lock [05:41:19] and maybe followed by vim /run/lock/scap-global-lock and enterring why [05:41:22] DannyS712: ^ [05:41:49] the next person trying to scap will see that messages and scap won't start [05:41:58] they'll also see who created the file [05:42:36] neat [06:58:57] (03Abandoned) 10GergΕ‘ Tisza: GrowthExperiments: On testwiki, enable variant C/D for now users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633889 (owner: 10GergΕ‘ Tisza) [07:22:44] 10Operations, 10LDAP-Access-Requests: LDAP access for Jan Jaquemot - https://phabricator.wikimedia.org/T267771 (10JanJaquemot) [07:47:35] PROBLEM - MariaDB Replica Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:04:08] <_joe_> checking ^^ [08:20:54] 10Operations, 10Commons, 10MediaWiki-File-management, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), and 2 others: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10Joe) Not many people are around, and most importantly no one with extensive WanC... [08:33:45] 10Operations, 10LDAP-Access-Requests: LDAP access for Jan Jaquemot - https://phabricator.wikimedia.org/T267771 (10Aklapper) 05Openβ†’03Stalled @JanJaquemot: Hi, please see and follow https://phabricator.wikimedia.org/project/profile/1564/ for required information. [08:39:17] RECOVERY - MariaDB Replica Lag: pc1 on pc2010 is OK: OK slave_sql_lag Replication lag: 46.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:41:01] (03CR) 10Hashar: [C: 03+2] "Timo gave the explanation over night at T267668#6619742 and following comments. I guess the sole reason that has not been deployed immedi" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640746 (https://phabricator.wikimedia.org/T267668) (owner: 10Krinkle) [08:47:14] 10Operations, 10LDAP-Access-Requests: LDAP access for Jan Jaquemot - https://phabricator.wikimedia.org/T267771 (10JanJaquemot) 05Stalledβ†’03Open [09:05:07] (03PS8) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) [09:08:09] (03Merged) 10jenkins-bot: Revert "filerepo: clean up shared cache keys to avoid key metrics clutter" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640746 (https://phabricator.wikimedia.org/T267668) (owner: 10Krinkle) [09:12:13] <_joe_> ok, pulling on debug1001 [09:12:25] !log Pulled https://gerrit.wikimedia.org/r/640746 on deployment server for # T267668 [09:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:32] T267668: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 [09:13:05] <_joe_> confirmed fixed [09:13:10] whew! [09:13:20] <_joe_> so look here: https://de.wikipedia.org/wiki/Benutzer_Diskussion:GLavagetto_(WMF) [09:13:52] <_joe_> oh sigh [09:14:02] <_joe_> so whenever the page is seen without the fix [09:14:10] <_joe_> it staays broken until the next edit [09:14:40] null edit is ok? [09:14:48] <_joe_> action=purge is ok [09:15:00] ok so a solid workaround at any rate [09:15:20] <_joe_> I now see the images https://de.wikipedia.org/wiki/Benutzer_Diskussion:GLavagetto_(WMF) from an incognito window though [09:15:59] uh [09:16:06] <_joe_> so if I edit the page without the fix [09:16:09] without going to mwdebug1001? [09:16:09] <_joe_> it breaks again [09:16:16] ok [09:16:25] <_joe_> if I do the last edit with the fix, it renders correctly [09:16:33] <_joe_> ok, that's enough to tell me the fix works [09:16:36] I think that's expected [09:16:37] great [09:16:42] lets scap sync it so ? ;] [09:16:45] +1 to roll out [09:17:13] <_joe_> +1 [09:17:29] oh my god [09:17:30] <_joe_> https://test.wikipedia.org/wiki/User_talk:GLavagetto_(WMF) is fixed as well [09:17:31] scap warns me [09:17:50] <_joe_> hashar: about social distancing yourself from our codebase? [09:17:56] wut, is this the scap lock? [09:17:57] <_joe_> very considerate of it [09:18:18] na well [09:18:19] if so, that's timo, exactly for this, you can clean it up [09:18:28] or "it can be cleane dup" I mean [09:18:33] I used scap sync-world, and it wanrs it will rebuild the l10n cache and I should use sync-file instead [09:18:42] scap sync-file php-1.36.0-wmf.16/includes/filerepo :] [09:18:47] oh uh meh blah [09:18:52] <_joe_> uhhh [09:19:09] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.16/includes/filerepo: Revert "filerepo: clean up shared cache keys to avoid key metrics clutter" - T267668 (duration: 01m 01s) [09:19:15] does sync-file get the dir? I haven't been able to keep track of the changes recently [09:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:17] T267668: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 [09:19:26] <_joe_> hashar: sync the test too :P [09:19:28] <_joe_> apergos: yes [09:19:30] so theorically that is fixed now [09:19:39] <_joe_> lemme check [09:19:42] pages still broken due to the cache would need to be purged [09:19:52] yes but that can be noted on the task at least [09:19:55] or just wait for the cache to expire I guess [09:20:01] as opposed to "purge has no effect" [09:20:07] <_joe_> yep fixed for stuff that's not cached [09:20:34] in th meantime the pc2010 replag fixed itself... [09:20:42] and of course now we have a bunch of logspam caused by the revert :-\ [09:21:08] that patch will have to be fixed up, next week i guess [09:21:13] PHP Warning: Declaration of ForeignDBViaLBRepo::getSharedCacheKey(...$args) should be compatible with LocalRepo::getSharedCacheKey($kClassSuffix, ...$components) [09:21:24] <_joe_> uhm [09:21:29] uh [09:21:39] the revert should have got those [09:22:06] 10Operations, 10Commons, 10MediaWiki-File-management, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), and 2 others: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10Joe) The current situation is: - this should not happen for new images. The rev... [09:22:30] 10Operations, 10Commons, 10MediaWiki-File-management, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), and 2 others: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10hashar) And there is now some method signature mismatches: PHP Warning: Declara... [09:23:08] LocalRepo::getSharedCacheKey should just take ...$args now [09:23:11] how is that possible [09:25:09] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/640746/1/includes/filerepo/LocalRepo.php#504 [09:25:26] hashar: any thoughts? [09:25:31] digging [09:26:17] hmm [09:26:35] so [09:26:55] that was a one off spike [09:26:59] ah [09:27:11] cause the deployment is not atomic [09:27:21] so some files copied, other files not there yet, boom [09:27:28] and or maybe the PHP opcache refreshed the files apart from each others [09:27:41] that's not lovely but I'll live with it in the instance [09:27:54] so even if we send A and B at the almost exact same time, maybe the opcache only refreshes A and then B some second later [09:28:06] so some requests landing between the refreshes ends up running a mixed state of things [09:28:09] not seeing the logstash spam now? [09:31:11] yeah it is gone [09:31:14] whew ok [09:31:15] it lasted for a few seconds [09:31:31] one one of the server the warning got emitted for 6 seconds [09:31:44] which is within the PHP opcache refresh window (10 secs iirc) [09:31:44] 10Operations, 10Commons, 10MediaWiki-File-management, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), and 2 others: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10hashar) The `getSharedCacheKey()` methods definitely have `...$args` as argument... [09:31:49] I would suspect that [09:32:03] not that I know anything about how the opcache works [09:32:14] but I imagine it is per file based, and each file has its own refresh timer [09:32:16] well the deploy not being atomic is sufficient as a cause [09:32:26] so that's enough for me to stop worrying basically [09:32:30] so we land A and B roughly at the same time (rsync) [09:33:22] right, I would just... meh. I mean, 'make a copy, rsync to the copy, move the copy into place' is at least closer but then we still do not have restart php-fpm [09:33:23] and if the opcache window is per file based and with a 10 seconds TLL, A might refresh almost immediately while in the worse case B get refreshed almost 10 seconds later if it got refreshed just before rsync changed it [09:33:29] anyways that is another issue altogether [09:33:34] yeah [09:34:08] atomically deploying would be great to reach eventually :-\ [09:34:17] πŸ’― [09:35:17] there was (maybe still is) a nice issue with objects serialized in the cache [09:35:25] all right, I'm going to have an eye open on this channel for the next half hour or so but be working mostly in another window [09:35:37] that could potentially be deserialized with a newer version of the class that got serialized [09:35:38] after that half hour I won't even have that eye ;-) [09:35:42] leading to some nice effects :] [09:36:06] _joe_: apergos: thank you for the tests! [09:36:20] thanks for the deploy! [09:53:14] <_joe_> yw [10:07:41] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/636017 (owner: 10PipelineBot) [10:09:46] (03PS2) 10Mvolz: Update zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/636896 [10:11:29] logstash me errors looks quiet, my half hour is up [10:13:04] (03CR) 10Mvolz: [C: 03+2] Update zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/636896 (owner: 10Mvolz) [10:15:28] (03Merged) 10jenkins-bot: Update zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/636896 (owner: 10Mvolz) [10:24:34] (03PS1) 10ArielGlenn: use long ints for rev lengths in revsperpage [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/640806 (https://phabricator.wikimedia.org/T263319) [10:26:48] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) >>! In T93049#6607956, @Quiddity wrote: > One more new example of a duplication. > [[https://commons.wikimedia.org... [10:27:06] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] "This has been tested extensively, including against wikidata stub files from production." [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/640806 (https://phabricator.wikimedia.org/T263319) (owner: 10ArielGlenn) [10:53:44] 10Operations, 10LDAP-Access-Requests: LDAP access for Jan Jaquemot - https://phabricator.wikimedia.org/T267771 (10JanJaquemot) [10:59:19] (03PS2) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640687 (owner: 10PipelineBot) [11:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201112T1100). [11:02:21] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [11:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:50] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [11:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:12] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [11:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:15] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640687 (owner: 10PipelineBot) [11:22:11] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640687 (owner: 10PipelineBot) [11:27:45] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic: Puppet disabled in beta cluster varnish deployment-cache-text06 - https://phabricator.wikimedia.org/T267578 (10hashar) 05Openβ†’03Declined In T267561 , the varnish related packages have been upgraded to the ones for... [11:27:48] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10hashar) [11:30:35] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [11:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:43] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [11:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:22] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [11:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:23] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/636922 (owner: 10PipelineBot) [11:45:30] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/637011 (owner: 10PipelineBot) [11:45:43] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640434 (owner: 10PipelineBot) [11:45:47] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640430 (owner: 10PipelineBot) [11:45:51] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640428 (owner: 10PipelineBot) [11:45:56] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640427 (owner: 10PipelineBot) [11:46:01] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640424 (owner: 10PipelineBot) [11:46:02] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10hashar) modules/profile/manifests/cache/varnish/frontend.pp line 87 invokes `confd::file` which initalizes the confd module. It has: ` name=modules/confd/manifests/ini... [11:46:20] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640686 (owner: 10PipelineBot) [11:46:24] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640685 (owner: 10PipelineBot) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201112T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:17] that’s good because I was about to go for lunch :P [12:01:13] Hehe [12:02:56] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10hashar) I have moved the `confd::srv_dns: deployment-prep.eqiad.wmflabs` setting to a new puppet prefix `deployment-cache` at https://horizon.wikimedia.org/project/pre... [12:14:14] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267746 (10Peachey88) [12:14:17] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267748 (10Peachey88) [12:24:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:31:28] 10Operations, 10LDAP-Access-Requests: LDAP access for Till Mletzko - https://phabricator.wikimedia.org/T267744 (10tmletzko) 05Stalledβ†’03Open [12:32:30] 10Operations, 10LDAP-Access-Requests: LDAP access for Till Mletzko - https://phabricator.wikimedia.org/T267744 (10tmletzko) @Aklapper I updated the request. Thanks. [12:40:27] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir3002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [13:09:52] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10ArielGlenn) I was able to get further along by doing things manually as root on one of the instances, deployment-cache-text06. I looked at the systemd unit file, /li... [13:21:17] (03PS1) 10Muehlenhoff: Add DannyS712 to cn=nda [puppet] - 10https://gerrit.wikimedia.org/r/640810 (https://phabricator.wikimedia.org/T256367) [13:22:52] 10Operations: wmf_auto_restart_{jenkins,rsync} failing on releases2002 - https://phabricator.wikimedia.org/T267795 (10Joe) [13:23:05] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:51] (03CR) 10Muehlenhoff: [C: 03+2] Add DannyS712 to cn=nda [puppet] - 10https://gerrit.wikimedia.org/r/640810 (https://phabricator.wikimedia.org/T256367) (owner: 10Muehlenhoff) [13:26:43] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267748 (10Joe) p:05Triageβ†’03High [13:27:22] ACKNOWLEDGEMENT - Device not healthy -SMART- on ms-be2031 is CRITICAL: cluster=swift device=None instance=ms-be2031 job=node site=codfw Giuseppe Lavagetto T267748 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2031&var-datasource=codfw+prometheus/ops [13:38:01] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session updateVarDumps at mwmaint1002 (wiki=cswiki; T246539) [13:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:09] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [13:53:01] (03CR) 10Urbanecm: [C: 03+1] "LGTM, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) (owner: 10Jberkel) [14:00:34] (03PS1) 10Urbanecm: Add artsdatabanken.no to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640813 (https://phabricator.wikimedia.org/T267784) [14:28:48] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10AlexisJazz) [14:30:18] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10hashar) Ariel pointed /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_re2.so is the wrong version. One this task or another related one, I remembered someone marke... [14:31:36] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10hashar) ` name="$ sudo systemctl status varnish-frontend" ● varnish-frontend.service - varnish-frontend (Varnish HTTP Accelerator) Loaded: loaded (/lib/systemd/syst... [14:32:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10AlexisJazz) Hey it works now! [14:39:03] (03PS1) 10Zoranzoki21: Regenerate Bengali Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640816 (https://phabricator.wikimedia.org/T265553) [14:42:18] (03PS1) 10Giuseppe Lavagetto: docker-reporter: exclude wikispeech images [puppet] - 10https://gerrit.wikimedia.org/r/640817 [14:45:05] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-Ryasmeen: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10hashar) 05Openβ†’03Resolved I have enabled puppet again and ran it. Something I spotted: ` --- /lib/systemd/system/confd.service 2020-11-06 20:04:... [14:48:05] 10Operations, 10User-DannyS712: Access to security IRC channels for DannyS712 - https://phabricator.wikimedia.org/T267800 (10DannyS712) [14:54:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:56:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:00:44] 10Operations, 10User-DannyS712: Access to security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10DannyS712) [15:40:09] (03PS1) 10ArielGlenn: version 0.0.11 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/640831 [15:45:56] 10Operations, 10User-DannyS712: Access to security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10Dsharpe) 05Openβ†’03Resolved a:03Dsharpe Done. [15:48:49] (03CR) 10ArielGlenn: [C: 03+2] version 0.0.11 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/640831 (owner: 10ArielGlenn) [15:49:58] 10Operations, 10User-DannyS712: Access to security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10Dsharpe) 05Resolvedβ†’03Open Maybe I should leave this open to make sure you have access to the IRC channel that you intended. I granted access for you to to #wikimedia-security. That... [15:54:17] 10Operations, 10User-DannyS712: Access to security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10Urbanecm) [15:55:16] 10Operations, 10User-DannyS712: Access to security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10Urbanecm) a:05Dsharpeβ†’03None [15:58:46] 10Operations, 10Traffic: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Bugreporter) [16:11:53] !log End of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session updateVarDumps at mwmaint1002 (wiki=cswiki; T246539) [16:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:02] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [16:12:23] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux at mwmaint1002 (wiki=jawiki; T246539) [16:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:39] 10Operations, 10Commons, 10MediaWiki-File-management, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), and 2 others: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10hashar) 05Openβ†’03Resolved a:03Krinkle [16:41:54] PROBLEM - Check systemd state on logstash2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:08] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f38b443e4a8: Failed to establish a new connection: [Errno 111] Connection [16:42:08] ://wikitech.wikimedia.org/wiki/Search%23Administration [16:43:15] <_joe_> uhm I would say es on that server crashed [16:44:02] <_joe_> yes [16:44:48] RECOVERY - Check systemd state on logstash2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:04] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2005 is OK: OK - elasticsearch status production-logstash-codfw: delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, number_of_data_nodes: 3, number_of_nodes: 6, unassigned_shards: 0, status: green, cluster_name: production-logstash-codfw, active_shards_percent_as_number: 100.0, initializing_shards: 0, active_primary_shards: 4 [16:45:04] lse, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, active_shards: 865 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:46:27] yes I just restarted it, but looking into why it crashed [16:47:16] <_joe_> herron: hah I didn't expect you'd be around :) [16:47:35] 10Operations, 10User-DannyS712: Access to security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10DannyS712) I can join `#wikimedia-security`, but not `#mediawiki_security` [16:47:38] well hello! [16:48:56] ah, kernel oom killer came around [16:49:38] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker-reporter: exclude wikispeech images [puppet] - 10https://gerrit.wikimedia.org/r/640817 (owner: 10Giuseppe Lavagetto) [16:51:24] <_joe_> oh sigh releases2002 [16:55:03] 10Operations, 10Traffic: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Joe) p:05Triageβ†’03High [16:58:43] 10Operations, 10Traffic: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Joe) I strongly doubt the problem happens at the traffic layer. This seems to be a different kind of problem - maybe those pages once edited overflow some specific limit. A null... [17:00:04] jbond42 and cdanis: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201112T1700). [17:59:48] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 4 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) 05Openβ†’03Resolved a:03Ladsgroup Guess when changes got merged: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&o... [18:00:05] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201112T1800). [19:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201112T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:05:34] (03CR) 10Urbanecm: [C: 03+2] Add artsdatabanken.no to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640813 (https://phabricator.wikimedia.org/T267784) (owner: 10Urbanecm) [19:06:26] (03Merged) 10jenkins-bot: Add artsdatabanken.no to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640813 (https://phabricator.wikimedia.org/T267784) (owner: 10Urbanecm) [19:08:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3ce18e6f63abe060c05c40239b651086f65a1a33: Add artsdatabanken.no to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T267784) (duration: 01m 00s) [19:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:23] T267784: Add artsdatabanken.no to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T267784 [19:08:39] (03PS5) 10Urbanecm: Enable "Cite" button in toolbar for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) (owner: 10Jberkel) [19:08:45] (03CR) 10Urbanecm: [C: 03+2] Enable "Cite" button in toolbar for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) (owner: 10Jberkel) [19:09:39] (03Merged) 10jenkins-bot: Enable "Cite" button in toolbar for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) (owner: 10Jberkel) [19:12:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0f0f8397424d4337cdcd61f7acb276d4f0b1facd: Enable "Cite" button in toolbar for enwiktionary (T267504) (duration: 00m 58s) [19:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:25] T267504: Enable Cite button in toolbar for English Wiktionary - https://phabricator.wikimedia.org/T267504 [19:12:37] * Urbanecm done [20:12:54] 10Operations, 10ops-codfw, 10DC-Ops: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Specs for Eaton PDU attached: No Master/Expansion, but PDUs can be linked together Sample PDU can be sent in 2-3 weeks {F33913382} [20:40:14] 10Operations, 10Commons, 10MediaWiki-File-management, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), and 2 others: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10AntiCompositeNumber) Confirming that 500/500 of the most recent uploads are work... [20:51:46] (03PS1) 10Hashar: gerrit: fix Prometheus excludeMetrics patterns [puppet] - 10https://gerrit.wikimedia.org/r/640850 [20:52:35] (03CR) 10Hashar: "That follows up Iedb84475bdff35c7018b6f35dc3ab5c0a7c0ccce :)" [puppet] - 10https://gerrit.wikimedia.org/r/640850 (owner: 10Hashar) [22:48:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:08] 10Operations, 10Commons, 10MediaWiki-File-management, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), and 2 others: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10Urbanecm) Thanks @Krinkle and everyone else who was involved in this!