[00:00:39] Amir1: sorry i'll let you deal with the fire. Just keen to preempt other issues before they become a problem. [00:00:42] Amir1: did the table go from test too? [00:00:55] apparently [00:01:07] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks sudo group: encode yet more ldap values as utf8 [puppet] - 10https://gerrit.wikimedia.org/r/586476 (https://phabricator.wikimedia.org/T249494) (owner: 10Andrew Bogott) [00:01:28] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.26/extensions/Wikibase/repo/includes/Store/Sql/DatabaseSchemaUpdater.php: Do not try to drop things when theres no wb_terms table T208425 T249565 cache bust (duration: 01m 01s) [00:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:34] and the beta cluster won't let me even try to add a link [00:05:05] Reedy: Can you adjust the /topic? [00:05:15] I can [00:05:20] If someone can tell me what it was before :P [00:06:27] Reedy has changed topic for #wikimedia-operations from "Up | Log: https://bit.ly/wikitech | Channel logs: https://bit.ly/opsirclog | Ops Clinic Duty: moritzm" [00:06:41] :) [00:14:49] addshore testwikidata still down [00:15:30] I suspect the table needs re-creating there too [00:15:47] Oh! [00:15:57] I saw that testwikidata was having DB issues a few hours ago. [00:16:00] Damn. [00:16:05] Might well have been this. :-( [00:16:23] T249533 [00:16:23] T249533: testwikidata wiki is broken with "Cannot access the database" - https://phabricator.wikimedia.org/T249533 [00:17:50] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['backup2002.codfw.wmnet'] ` [00:18:26] Posting here for a faster response - would it make sense to temporarily stop the addition and changing of site links on wikidata, to avoid creating more duplicates? [00:18:36] The following should work as an abuse filter: [00:18:37] page_namespace == 0 &( "wbsetsitelink-add" in summary | "wbsetsitelink-set" in summary ) [00:18:41] DannyS712: No. [00:18:56] (Because the data would drift elsewhere.) [00:19:11] Okay. Thats what I thought, but just crossposting the suggestion from https://phabricator.wikimedia.org/T249565 [00:19:18] * James_F nods. [00:22:52] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10Papaul) @jcrespo the installation is failing at partition disk maybe something wrong with the partman recipe ~~~ ────────────────────────┤ [!!] Partit... [00:26:27] (03CR) 10Zhuyifei1999: [C: 03+1] tools-static: apply SNI name setting to fontcdn as well [puppet] - 10https://gerrit.wikimedia.org/r/586475 (https://phabricator.wikimedia.org/T249558) (owner: 10Bstorm) [00:33:06] PROBLEM - PHP opcache health on mw2332 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:34:53] presumably there should be an announcement about this to editors very soon? [00:35:06] sitelinks/infoboxes being broken is very noticeable [00:51:28] RECOVERY - PHP opcache health on mw2332 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:53:57] sorry we didn't a bad enough of a job on GlobalPrefs! hehe https://phabricator.wikimedia.org/T249565#6034809 [00:55:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.147e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [01:05:04] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.26/extensions/Wikibase/repo/includes/Store/Sql/DatabaseSchemaUpdater.php: T208425 T249565 Follow-up a956c655: Only avoid dropping wb_items_per_site so prod can be merged (duration: 00m 58s) [01:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:12] T208425: [EPIC] Kill the wb_terms table - https://phabricator.wikimedia.org/T208425 [01:05:12] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [01:05:25] tgr|away: what sorts of breakages are you noticing? [01:06:25] I'm guessing what you mention is what I described as "Client pages that render accessing data from wikidata may not receive all of the data they would desire (LUA and parser functions)" [01:06:38] addshore: recently edited pages have no sitelinks or wikidata description [01:06:47] or infoboxes etc. presumably [01:06:52] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.26/autoload.php: T157651 Remove sql.php from autoloader (duration: 00m 58s) [01:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:57] T157651: sql.php runs LoadExtensionSchemaUpdates - https://phabricator.wikimedia.org/T157651 [01:07:08] James_F: there we have it, sitelinks vanishing [01:07:13] How recent is recently? [01:07:28] tgr: I suspected that ight happen but couldnt pin down the code path [01:07:51] Hmm. [01:08:26] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.26/maintenance/: T157651 Remove sql.php from maintenance/ (duration: 00m 58s) [01:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:56] AntiComposite: after 1:00 UTC [01:10:17] So, last 9 minutes. [01:11:04] no, last two hours [01:11:54] it's 01:11 UTC right now [01:12:44] so after 2020-04-06 23:00 UTC? [01:13:17] Since the DB breakage, then. [01:16:08] I mean, I didn't verify how long it goes back, but why would it be otherwise? [01:16:52] (yeah, my bad about the time zones) [01:17:23] Because we were looking and didn't see it earlier. [01:17:32] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [01:17:37] Which we were surprised by. [01:17:57] Based on your comment on the phab task, I'd expect to see way more damage in https://commons.wikimedia.org/wiki/Special:RecentChangesLinked?hidebots=1&translations=filter&hidepageedits=1&hidecategorization=1&hideWikibase=1&target=Category%3AUses_of_Wikidata_Infobox&limit=50&days=30&urlversion=2 [01:19:18] a few recently edited pages do have sitelinks; I'd guess those have been added back manually [01:19:39] tgr: https://phabricator.wikimedia.org/T249565#6034918 [01:20:01] every edit on wikidata will refill the current temporary table [01:20:33] There is a maintenance script running from Q1 upwards too, but i doubt it will even reach Q10million by time we finish restoring the table [01:20:53] other things that might be seen across the sites is at https://phabricator.wikimedia.org/T249565#6034913 [01:21:22] but ultimately, and edit to the relevate wikidata item should fix things until we finish restoring the table [01:21:23] thanks [01:21:26] * addshore is off now [01:21:46] the "may not result in updates on wikidata" lines sound bad, is that something that's easily fixable eventually? [01:22:23] well I guess page moves / deletions for one day is not a staggering amount of events, anyway [01:22:40] we will probably have to write some maint script to re fire a bunch of jobs, maybe, for all client sites [01:22:53] yeah, it should be a fairly small list to reason with [01:23:07] o/ [01:23:39] AntiComposite: this only affects linked items, not random access by Q-id [01:23:51] so probably not much on commons [01:24:08] {{wikidata infobox}} on categories is done all with sitelinks [01:30:52] not sure what's going on then [01:31:43] infoboxes on other wikis are definitely broken [01:31:51] e.g. https://hu.wikipedia.org/wiki/Igl%C3%B3i_Mih%C3%A1ly [01:32:08] :/ [01:32:45] The worst thing is, it is quite hard to find the wikidata item for https://hu.wikipedia.org/wiki/Igl%C3%B3i_Mih%C3%A1ly without the index to make and edit to fix the issue! [01:33:27] I got it, let me try an edit https://www.wikidata.org/wiki/Q973558 [01:34:05] wikidata.org search works, so it's not hard, just cumbersome [01:34:20] tgr: https://hu.wikipedia.org/wiki/Igl%C3%B3i_Mih%C3%A1ly :) [01:34:22] fixed [01:34:46] but, probably not a good solution for the masses of pages, but probably good for the big ones [01:35:04] can we send out a massmessage? [01:35:05] but we will make sure to reject / purge the parser cache for the "bade time" once the data is back [01:35:41] MassMessage to everyone? :P [01:35:43] Sitenotices? [01:37:08] loggedin-only centralnotice, if that's easy to do [01:37:23] but a mass message to technical village pumps would do as well [01:37:50] https://meta.wikimedia.org/wiki/Distribution_list/Technical_Village_Pumps_distribution_list [01:39:17] just telling people to not try to edit sitelinks but find and purge the wikidata item instead [01:39:47] I dont think a purge will do it, needs to be an actual edit [01:40:00] maybe a purge with force links update (or whatever its called) [01:41:02] null edit or real edit? [01:41:07] addshore indeed, normal purge doesn't do it [01:42:10] tgr: Can't do null edits on Wikidata items. [01:42:31] Well, users can't. I guess you could trigger one via mwmaint1001. [02:08:30] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 57 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:09:16] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 57 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:11:54] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 60 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:43:24] tgr|away: addshore Reedy James_F Hi! Just got pinged by the CentraNotice reference above... FWIW it's easy to make a CentralNotice notice only to looger-in usres. pls lmk if you need help :) good luck! [02:53:30] PROBLEM - PHP opcache health on mw2328 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:04:38] RECOVERY - PHP opcache health on mw2328 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:43:12] PROBLEM - PHP opcache health on mw2333 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:07:12] (03CR) 10Ladsgroup: TEST: entity source, use modern repoDatabase and interwikiPrefix (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586366 (https://phabricator.wikimedia.org/T248664) (owner: 10Addshore) [04:11:51] (03PS1) 10Ladsgroup: Fix database name for repo in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586488 (https://phabricator.wikimedia.org/T249533) [04:15:20] James_F: hey, around for this ^ https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/586488 [04:17:06] Ha. [04:19:17] (03PS2) 10Jforrester: Follow-up 4d11e15ed0: Fix database name for repo in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586488 (https://phabricator.wikimedia.org/T249533) (owner: 10Ladsgroup) [04:19:48] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:23:30] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:24:04] RECOVERY - PHP opcache health on mw2333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:04:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 41 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:07:12] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:09:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:26:50] !log elukey@cumin1001 START - Cookbook sre.wdqs.data-transfer [05:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:31] (03CR) 10Ladsgroup: [C: 03+2] Follow-up 4d11e15ed0: Fix database name for repo in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586488 (https://phabricator.wikimedia.org/T249533) (owner: 10Ladsgroup) [05:31:54] (03Merged) 10jenkins-bot: Follow-up 4d11e15ed0: Fix database name for repo in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586488 (https://phabricator.wikimedia.org/T249533) (owner: 10Ladsgroup) [05:35:06] the patch fixes the issue in mwdebug1001, moving forward _joe_ [05:35:27] <_joe_> 👍 [05:35:30] 10Operations, 10Traffic, 10good first task: Only retry failed requests for external traffic on cache frontends - https://phabricator.wikimedia.org/T249317 (10ema) >>! In T249317#6034060, @srishakatux wrote: > @ema Hello! As this task is tagged as a #good_first_task, I'm wondering if it can be made clear wher... [05:37:03] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:586488|Fix database name for repo in testwikidata (T249533)]] (duration: 01m 00s) [05:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:08] T249533: testwikidata wiki is broken with "Cannot access the database" - https://phabricator.wikimedia.org/T249533 [05:38:21] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:586488|Fix database name for repo in testwikidata (T249533)]], take II (duration: 00m 58s) [05:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:36] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:45:24] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:50:31] (03PS2) 10Muehlenhoff: admin: add kerberos flag for aklapper [puppet] - 10https://gerrit.wikimedia.org/r/586208 (https://phabricator.wikimedia.org/T248905) (owner: 10Elukey) [05:51:09] (03CR) 10jerkins-bot: [V: 04-1] admin: add kerberos flag for aklapper [puppet] - 10https://gerrit.wikimedia.org/r/586208 (https://phabricator.wikimedia.org/T248905) (owner: 10Elukey) [05:52:52] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [05:53:46] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10MoritzMuehlenhoff) @Aklapper I have just created your Kerberos account. You will have received a mail to your wikimedia.or... [05:55:53] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/586208 (https://phabricator.wikimedia.org/T248905) (owner: 10Elukey) [05:57:28] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#6031455, @Gilles wrote: > Are there any other upcoming performance improvements in the p... [05:59:23] (03CR) 10Muehlenhoff: [C: 03+2] admin: add kerberos flag for aklapper [puppet] - 10https://gerrit.wikimedia.org/r/586208 (https://phabricator.wikimedia.org/T248905) (owner: 10Elukey) [06:02:14] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10MoritzMuehlenhoff) 05Open→03Stalled [06:03:00] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:07:51] 10Operations, 10Traffic: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 (10ema) [06:07:57] 10Operations, 10Traffic: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 (10ema) p:05Triage→03High [06:14:12] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10ema) We have discussed this during yesterday's #traffic meeting and the current plan to... [06:16:52] 10Operations, 10Traffic, 10Patch-For-Review: varnishd crashes in vbf_stp_condfetch(): cp3057 and cp3061 - https://phabricator.wikimedia.org/T249344 (10ema) 05Open→03Resolved a:03ema 5.1.3-1wm13 deployed. [06:19:43] 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10ema) p:05Medium→03High [06:30:54] (03PS1) 10Vgutierrez: ATS: Re-enable parent proxies on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/586494 (https://phabricator.wikimedia.org/T249335) [06:32:31] !log stopping slave (s3) on db1095 [06:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:14] !log installing ruby2.1 security updates [06:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:16] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/21728/" [puppet] - 10https://gerrit.wikimedia.org/r/586494 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [06:41:14] (03CR) 10Ema: [C: 03+1] ATS: Re-enable parent proxies on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/586494 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [06:48:46] (03PS1) 10Jcrespo: restore: Add s8 instance to db1095 [puppet] - 10https://gerrit.wikimedia.org/r/587010 (https://phabricator.wikimedia.org/T157651) [06:49:10] (03CR) 10Vgutierrez: [C: 03+2] ATS: Re-enable parent proxies on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/586494 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [06:50:28] (03CR) 10Marostegui: [C: 03+1] "Space available on disk 2.7TB, which should be enough" [puppet] - 10https://gerrit.wikimedia.org/r/587010 (https://phabricator.wikimedia.org/T157651) (owner: 10Jcrespo) [06:52:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:57] !log elukey@cumin1001 START - Cookbook sre.wdqs.data-transfer [06:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:58] PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 5142 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:02:20] !log updating linux-image-4.9.0-11-amd64 where applicable [07:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:17:03] (03CR) 10Jcrespo: [C: 03+2] restore: Add s8 instance to db1095 [puppet] - 10https://gerrit.wikimedia.org/r/587010 (https://phabricator.wikimedia.org/T157651) (owner: 10Jcrespo) [07:18:35] the high lag for wdqs2008 is me, just imported the data, new node [07:19:09] !log restarting s3 on db1095 [07:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:58] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10elukey) [07:31:14] 10Operations, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10Joe) I am going to get to the bottom of this today. My plan of action is: - convert one parsoid server back to envoy, de... [07:31:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:31:53] !log enable parent proxies in ats-tls - T249335 [07:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:59] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [07:36:03] (03PS1) 10Giuseppe Lavagetto: parsoid: convert to envoy wtp1025 [puppet] - 10https://gerrit.wikimedia.org/r/587193 [07:39:37] <_joe_> !log depooling wtp1025, used for debugging [07:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:16] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@23495ae]: deploying wdqs 0.3.17 to wdqs2002: T249196 [07:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:22] T249196: Test the impact of the wdqs updater performance by disabling values cleanup - https://phabricator.wikimedia.org/T249196 [07:40:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] parsoid: convert to envoy wtp1025 [puppet] - 10https://gerrit.wikimedia.org/r/587193 (owner: 10Giuseppe Lavagetto) [07:41:44] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@23495ae]: deploying wdqs 0.3.17 to wdqs2002: T249196 (duration: 01m 28s) [07:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:13] RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3600 ge (W)1200 ge 832.2 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:43:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1080 for schema change', diff saved to https://phabricator.wikimedia.org/P10920 and previous config saved to /var/cache/conftool/dbconfig/20200407-074321-marostegui.json [07:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:47] !log Deploy schema change on db1080 [07:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:24] PROBLEM - PHP opcache health on mw2329 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:45:26] PROBLEM - Check systemd state on wtp1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:24] (03PS2) 10Marostegui: wmnet: Replace dbproxy1011 with dbproxy1019 [dns] - 10https://gerrit.wikimedia.org/r/586207 (https://phabricator.wikimedia.org/T231520) [07:46:37] (03CR) 10Marostegui: [C: 03+2] wikireplicas_dns: Replace dbproxy1011 with dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/586206 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [07:46:55] (03CR) 10Marostegui: [C: 03+2] wmnet: Replace dbproxy1011 with dbproxy1019 [dns] - 10https://gerrit.wikimedia.org/r/586207 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [07:47:36] !log Failover dbproxy1011 to dbproxy1019 - T231520) [07:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:41] T231520: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 [07:50:04] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/586372 (https://phabricator.wikimedia.org/T248858) (owner: 10MSantos) [07:50:07] (03PS1) 10Gehel: wdqs: reload nginx when categories are reloaded. [cookbooks] - 10https://gerrit.wikimedia.org/r/587195 [07:51:04] (03CR) 10Elukey: [C: 03+1] wdqs: reload nginx when categories are reloaded. [cookbooks] - 10https://gerrit.wikimedia.org/r/587195 (owner: 10Gehel) [07:52:30] (03CR) 10DCausse: [C: 03+1] wdqs: reload nginx when categories are reloaded. [cookbooks] - 10https://gerrit.wikimedia.org/r/587195 (owner: 10Gehel) [07:52:57] <_joe_> !log disabling puppet on mwdebug1002 [07:53:00] 10Operations, 10observability: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o - https://phabricator.wikimedia.org/T240048 (10fgiunchedi) 05Open→03Resolved Tentatively resolving, `grafana-next` now is the standby / upgrade host [07:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:03] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10fgiunchedi) [07:53:26] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) I have changed DNS so dbproxy1019 is now becoming the active proxy for the web service: ` root@tools-sgebastion-07:~# hos... [07:55:16] (03PS1) 10Marostegui: dbproxy1011: Specify it is not active anymore [puppet] - 10https://gerrit.wikimedia.org/r/587197 [07:55:18] (03CR) 10Gehel: [C: 03+2] wdqs: reload nginx when categories are reloaded. [cookbooks] - 10https://gerrit.wikimedia.org/r/587195 (owner: 10Gehel) [07:56:50] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Specify it is not active anymore [puppet] - 10https://gerrit.wikimedia.org/r/587197 (owner: 10Marostegui) [07:59:46] (03PS1) 10Gehel: wdqs: better exception and threading handling during file transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 [08:03:56] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [08:04:37] (03CR) 10jerkins-bot: [V: 04-1] wdqs: better exception and threading handling during file transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 (owner: 10Gehel) [08:04:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1080 after schema change', diff saved to https://phabricator.wikimedia.org/P10921 and previous config saved to /var/cache/conftool/dbconfig/20200407-080443-marostegui.json [08:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 for schema change', diff saved to https://phabricator.wikimedia.org/P10922 and previous config saved to /var/cache/conftool/dbconfig/20200407-080533-marostegui.json [08:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:31] !log Deploy schema change on db1106 (this will generate lag on s1 labs) [08:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:02] PROBLEM - DPKG on wtp1025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:17:48] !log installing php5 security updates [08:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:30] (03CR) 10Dzahn: [C: 03+1] ci: Use docker.io on Buster [puppet] - 10https://gerrit.wikimedia.org/r/586203 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [08:20:04] PROBLEM - mediawiki-installation DSH group on wtp1025 is CRITICAL: Host wtp1025 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:21:12] PROBLEM - PHP opcache health on mw2304 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:24:42] RECOVERY - PHP opcache health on mw2329 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:25:38] (03PS2) 10Gehel: wdqs: better exception and threading handling during file transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 [08:26:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106 after schema change', diff saved to https://phabricator.wikimedia.org/P10923 and previous config saved to /var/cache/conftool/dbconfig/20200407-082607-marostegui.json [08:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [08:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:32] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:35] !log decom ganeti VM miscweb2001 (stretch) [08:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:04] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Gilles) @ema have you checked if there is a correlation with Keep-Alive headers? Eg. does restbase reply with a Keep-Alive head... [08:31:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:15] 10Operations, 10serviceops, 10Patch-For-Review: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `miscweb2001.codfw.wmnet` - miscweb2001.codfw.wmnet (**PASS**) - Downtime... [08:33:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [08:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:30] (03CR) 10Nikerabbit: "I did attempt to do that, but if I understand this right, wmgMonologChannels is processed in logging.php, which is loaded before CommonSet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586353 (https://phabricator.wikimedia.org/T165128) (owner: 10Nikerabbit) [08:36:52] 10Operations, 10vm-requests: codfw: 1 VM request for idp staging host - https://phabricator.wikimedia.org/T249594 (10MoritzMuehlenhoff) [08:37:08] 10Operations, 10vm-requests: codfw: 1 VM request for idp staging host - https://phabricator.wikimedia.org/T249594 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [08:37:18] PROBLEM - Check that envoy is running on mwdebug1002 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:34] !log decom ganeti VM miscweb1001 (stretch) - kept backup of old racktables files and db dump in /root/racktables on miscweb1002 (T247648) [08:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:40] T247648: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 [08:37:47] <_joe_> that is me on mwdebug1002 [08:38:16] ack [08:39:08] RECOVERY - Check that envoy is running on mwdebug1002 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:24] RECOVERY - PHP opcache health on mw2304 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:42:02] (03PS3) 10Dzahn: decom miscweb1001 and miscweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/586370 (https://phabricator.wikimedia.org/T247648) [08:42:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:54] 10Operations, 10serviceops, 10Patch-For-Review: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `miscweb1001.eqiad.wmnet` - miscweb1001.eqiad.wmnet (**PASS**) - Downtime... [08:44:00] !log enable uRPF loose mode (log only) on cr3-ulsfo v6 uplinks - T244147 [08:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:44] (03CR) 10Dzahn: [C: 03+2] decom miscweb1001 and miscweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/586370 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [08:45:02] (03CR) 10Dzahn: "decom cookbook finished before doing this" [puppet] - 10https://gerrit.wikimedia.org/r/586370 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [08:46:18] (03PS2) 10Dzahn: decom miscweb1001 and miscweb2001 [dns] - 10https://gerrit.wikimedia.org/r/586371 (https://phabricator.wikimedia.org/T247648) [08:46:57] !log enable uRPF loose mode (log only) on cr3-ulsfo v4 uplinks - T244147 [08:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:54] (03CR) 10Dzahn: [C: 03+2] decom miscweb1001 and miscweb2001 [dns] - 10https://gerrit.wikimedia.org/r/586371 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [08:49:24] (03PS1) 10Elukey: Move profile::refinery::job::data_check to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/587202 (https://phabricator.wikimedia.org/T249593) [08:49:26] (03PS1) 10Elukey: Add refine failure flag check for Eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/587203 (https://phabricator.wikimedia.org/T240230) [08:50:36] ACKNOWLEDGEMENT - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] Gehel OSM replication disabled https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:50:38] (03CR) 10Volans: [V: 03+2 C: 03+2] "As agreed with Arzhel offline, merging to test the deploy procedure and that the injection works as expected." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/586397 (owner: 10Volans) [08:53:44] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [08:53:48] !log volans@deploy1001 Started deploy [homer/deploy@a03d7cd]: Inject plugins [08:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:39] ^at the beginning of the month, full backups run, which may make some regular backups to get at bit delayed [08:54:41] got 99 backups and my state is fresh [08:54:48] ah [08:55:13] there is a buffer, but it is difficult to have a good balance between alerting and spaming [08:55:27] e.g. I think the hourly backups alert at 4 hours delayed [08:55:52] plus checks only happen every 1 hour [08:57:12] jynus: yea, i don't think it's spamming, good to see recovery [08:57:57] mutante: if interested, this helps: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&fullscreen&panelId=34&from=1586077073434&to=1586249873434&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-srv-gerrit-git [08:58:27] normally it stays within that band [08:58:45] but every 30 days it may get a bit delayed [08:58:47] !log volans@deploy1001 Finished deploy [homer/deploy@a03d7cd]: Inject plugins (duration: 04m 59s) [08:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:08] jynus: *nod*, thank you. makes sense [08:59:26] i see it's every 30 days, yea [08:59:54] when it's time for a full instead of a diff, ack [09:00:11] full for others so we limit concurrency [09:00:36] gerrit full works well: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=1585645225963&to=1586250025963&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-srv-gerrit-git&fullscreen&panelId=35 [09:01:04] gerrit happens to be the one we run every hour [09:01:08] so it will complain the first [09:01:22] 10Operations, 10serviceops: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [09:01:52] (03PS1) 10Muehlenhoff: Add DNS entry for idp-test2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/587206 (https://phabricator.wikimedia.org/T249594) [09:02:02] gotcha. cool [09:02:49] 10Operations, 10serviceops: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) 05Open→03Resolved miscweb1001 and miscweb2001 (stretch) have been removed. services have migrated to miscbweb1002 and miscweb2002 on buster. [09:02:52] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [09:04:19] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10jcrespo) [09:04:21] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) [09:04:43] ^do we have a master ticket for buster on databases? [09:04:47] !log testing ATS 8.0.6-1wm6 on cp4026 and cp4032 [09:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:52] sorry, wrong channel [09:05:06] 10Operations: migrate racktables to a buster VM (was: decom racktables?) - https://phabricator.wikimedia.org/T247646 (10Dzahn) racktables has now been migrated to miscweb1002.eqiad.wmnet on buster. The stretch VM has been decom'ed. The racktables version itself has been upgraded to 0.21.4 and the installer/upgr... [09:05:41] 10Operations: migrate racktables to a buster VM (was: decom racktables?) - https://phabricator.wikimedia.org/T247646 (10Dzahn) 05Open→03Resolved [09:05:43] 10Operations, 10serviceops: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [09:06:47] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [09:07:48] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10jcrespo) [09:07:53] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [09:10:42] (03PS1) 10Elukey: Revert "Revert "Revert "profile::analytics::refinery::job::refine: exclude TwoColConflictExit""" [puppet] - 10https://gerrit.wikimedia.org/r/587208 [09:15:09] (03CR) 10Elukey: [C: 03+2] Revert "Revert "Revert "profile::analytics::refinery::job::refine: exclude TwoColConflictExit""" [puppet] - 10https://gerrit.wikimedia.org/r/587208 (owner: 10Elukey) [09:16:59] !log volans@deploy1001 Started deploy [homer/deploy@a03d7cd]: Inject plugins (take 2) [09:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:28] !log volans@deploy1001 Finished deploy [homer/deploy@a03d7cd]: Inject plugins (take 2) (duration: 00m 29s) [09:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 for schema change', diff saved to https://phabricator.wikimedia.org/P10924 and previous config saved to /var/cache/conftool/dbconfig/20200407-091847-marostegui.json [09:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:04] !log Deploy schema change on db1134 [09:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:53] (03PS2) 10Volans: Update Homer's src to v0.2.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/586398 [09:19:55] (03PS1) 10Volans: Fix plugins injection [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/587209 [09:20:19] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) I know what is happening, backup[12]002 have only 1 disk shelf, so we need to remove references to the second self (which only exists on backup... [09:22:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/587206 (https://phabricator.wikimedia.org/T249594) (owner: 10Muehlenhoff) [09:24:22] (03CR) 10Volans: [V: 03+2 C: 03+2] "Necessary to test the fix with scap" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/587209 (owner: 10Volans) [09:24:40] 10Operations, 10Commons, 10SRE-swift-storage, 10User-fgiunchedi: Big number of uploads from DPLA bot - https://phabricator.wikimedia.org/T248151 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'd like to understand better how big of a dataset we're talking about for all uploads that @Dominicbm is wo... [09:25:24] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The aphlict service has been re-enabled on phab1001. The plan is to have ATS (caching layer) talk directly... [09:25:44] 10Operations, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10Joe) Ok, I found the culprit: - private wikis set the cookie `forceHTTPS: true` - We proxy to parsoid-php via `http://l... [09:26:14] 10Operations, 10serviceops, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10Joe) [09:26:24] !log volans@deploy1001 Started deploy [homer/deploy@ac7a818]: Inject plugins (take 3) [09:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:43] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) Actually those are not configured on partman, but a similar issue exists- the disk is references as sdc (third disk) and we need that to be the... [09:26:56] (03CR) 10Dzahn: "envoy is already running for TLS termination and already has the cert, we can also go via that by adding a second listener in /etc/envoy/l" [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [09:27:34] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/586461" [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [09:27:41] 10Operations, 10Performance-Team: Occasional NIC Tx bandwidth saturation for mc1027 - https://phabricator.wikimedia.org/T248962 (10elukey) >>! In T248962#6027084, @aaron wrote: > It stores the serialized naive "top frame" (e.g. headings, paragraphs, template invocation parameters) of the wikitext of pages, as... [09:29:27] !log volans@deploy1001 Finished deploy [homer/deploy@ac7a818]: Inject plugins (take 3) (duration: 03m 03s) [09:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:43] (03CR) 10Volans: [V: 03+2 C: 03+2] "Releasing to prod" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/586398 (owner: 10Volans) [09:30:50] (03CR) 10Dzahn: "While i do appreciate the hiera() to lookup() conversion of all phabricator parameters, it is kind of unrelated to configuring a certifica" [puppet] - 10https://gerrit.wikimedia.org/r/586461 (https://phabricator.wikimedia.org/T238593) (owner: 1020after4) [09:31:25] (03PS1) 10Jcrespo: backups: Assume backups have its ssds on sda and sdb for partman [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) [09:31:26] !log volans@deploy1001 Started deploy [homer/deploy@b4522ad]: Release v0.2.0 [09:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:43] !log volans@deploy1001 Finished deploy [homer/deploy@b4522ad]: Release v0.2.0 (duration: 00m 16s) [09:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:35] (03PS2) 10Jcrespo: backups: Assume backups have their ssds on sda and sdb for partman [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) [09:33:50] (03CR) 10Jcrespo: "mmmmh" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) (owner: 10Jcrespo) [09:36:19] (03CR) 10Hnowlan: [C: 03+1] ChangeProp: add more metrics and deploy the latest code (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/586439 (https://phabricator.wikimedia.org/T248677) (owner: 10Ppchelko) [09:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1134 after schema change', diff saved to https://phabricator.wikimedia.org/P10925 and previous config saved to /var/cache/conftool/dbconfig/20200407-093638-marostegui.json [09:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 for schema change', diff saved to https://phabricator.wikimedia.org/P10926 and previous config saved to /var/cache/conftool/dbconfig/20200407-093820-marostegui.json [09:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:36] !log Deploy schema change on db1119 [09:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:40] (03CR) 10Dzahn: "will change config (and restart) vcs and phd and not just affect aphlict:" [puppet] - 10https://gerrit.wikimedia.org/r/586461 (https://phabricator.wikimedia.org/T238593) (owner: 1020after4) [09:41:50] (03PS1) 10Volans: Built wheels for v0.2.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/587216 [09:42:44] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21730/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/586461 (https://phabricator.wikimedia.org/T238593) (owner: 1020after4) [09:43:19] (03PS7) 10Dzahn: ATS/phabricator: configure aphlict certificate [puppet] - 10https://gerrit.wikimedia.org/r/586461 (https://phabricator.wikimedia.org/T238593) (owner: 1020after4) [09:47:00] (03CR) 10Ayounsi: [C: 03+1] "Verified that all the new version numbers are greater than the old ones." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/587216 (owner: 10Volans) [09:48:40] (03CR) 10Volans: [V: 03+2 C: 03+2] Built wheels for v0.2.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/587216 (owner: 10Volans) [09:48:58] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry for idp-test2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/587206 (https://phabricator.wikimedia.org/T249594) (owner: 10Muehlenhoff) [09:49:18] !log volans@deploy1001 Started deploy [homer/deploy@887544c]: Release v0.2.0 (take 2) [09:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:44] !log volans@deploy1001 Finished deploy [homer/deploy@887544c]: Release v0.2.0 (take 2) (duration: 00m 26s) [09:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:44] I am working on kafka-jumbo1009, not a live host, if it says "down" no problem [09:58:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1119 after schema change', diff saved to https://phabricator.wikimedia.org/P10927 and previous config saved to /var/cache/conftool/dbconfig/20200407-095852-marostegui.json [09:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:21] (03CR) 10Hnowlan: [C: 04-1] "the repo index.yaml is lacking the new version - the `helm repo index` should happen after the package has been generated" [deployment-charts] - 10https://gerrit.wikimedia.org/r/586439 (https://phabricator.wikimedia.org/T248677) (owner: 10Ppchelko) [10:02:58] (03CR) 1020after4: "I only converted the `hiera` calls to `lookup` because CI wouldn't let me add any new `hiera` calls and it was weird having a mixture of b" [puppet] - 10https://gerrit.wikimedia.org/r/586461 (https://phabricator.wikimedia.org/T238593) (owner: 1020after4) [10:04:38] (03PS1) 10Hoo man: wikibasedumps-shared: Query using mysql.php, not sql.php [puppet] - 10https://gerrit.wikimedia.org/r/587218 [10:04:48] (03PS1) 10Muehlenhoff: Initial version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 [10:12:40] jouncebot: now [10:12:40] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [10:12:53] I'm going to backport https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/587217/ [10:20:06] (03PS1) 1020after4: ATS/phabricator: enable aphlict certificate in hiera. [puppet] - 10https://gerrit.wikimedia.org/r/587224 (https://phabricator.wikimedia.org/T238593) [10:20:39] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:21:35] (03PS2) 10Hnowlan: ChangeProp: add more metrics and deploy the latest code [deployment-charts] - 10https://gerrit.wikimedia.org/r/586439 (https://phabricator.wikimedia.org/T248677) (owner: 10Ppchelko) [10:21:45] PROBLEM - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:22:15] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22416 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:23:21] (03CR) 10ArielGlenn: "Could you add the --group param so the query runs on the dump dbs? I know it's a super fast query but we might as well keep everything on " [puppet] - 10https://gerrit.wikimedia.org/r/587218 (owner: 10Hoo man) [10:27:06] !log starting recovery on db1099:3318 [10:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:28:56] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) So we have just one last remaining issue to deal with: ` Unable to open file ("/etc/ssl/private/phabrica... [10:29:05] (03PS2) 10Hoo man: wikibasedumps-shared: Query using mysql.php, not sql.php [puppet] - 10https://gerrit.wikimedia.org/r/587218 [10:29:07] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:29:35] (03CR) 10MSantos: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/586372 (https://phabricator.wikimedia.org/T248858) (owner: 10MSantos) [10:29:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:31:27] (03PS1) 10Dzahn: phabricator: enable TLS for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587225 (https://phabricator.wikimedia.org/T238593) [10:33:08] (03CR) 10Hnowlan: [C: 03+2] ChangeProp: add more metrics and deploy the latest code [deployment-charts] - 10https://gerrit.wikimedia.org/r/586439 (https://phabricator.wikimedia.org/T248677) (owner: 10Ppchelko) [10:33:31] (03Merged) 10jenkins-bot: ChangeProp: add more metrics and deploy the latest code [deployment-charts] - 10https://gerrit.wikimedia.org/r/586439 (https://phabricator.wikimedia.org/T248677) (owner: 10Ppchelko) [10:35:19] (03CR) 10ArielGlenn: [C: 03+2] "Good to go. Will deploy right after merge." [puppet] - 10https://gerrit.wikimedia.org/r/587218 (owner: 10Hoo man) [10:37:07] (03PS2) 10Dzahn: phabricator: enable TLS for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587225 (https://phabricator.wikimedia.org/T238593) [10:39:54] <_joe_> jouncebot: next [10:39:54] In 0 hour(s) and 20 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200407T1100) [10:39:55] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 22409 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:40:40] * addshore is syncing a maint script udpate currently [10:41:20] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.26/extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php: T249565 T249596 Wikibase rebuildItemsPerSite.php script that allows lists of ids (duration: 01m 00s) [10:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:28] T249596: Rebuild wb_items_per_site, after incident where wb_items_per_site was dropped - https://phabricator.wikimedia.org/T249596 [10:41:28] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [10:44:26] (03CR) 1020after4: [C: 03+1] phabricator: enable TLS for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587225 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [10:45:40] (03Abandoned) 1020after4: ATS/phabricator: enable aphlict certificate in hiera. [puppet] - 10https://gerrit.wikimedia.org/r/587224 (https://phabricator.wikimedia.org/T238593) (owner: 1020after4) [10:45:58] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:50] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: allow adding XFP header, enable on parsoid/restbase [puppet] - 10https://gerrit.wikimedia.org/r/587227 (https://phabricator.wikimedia.org/T249535) [10:47:19] RECOVERY - PHP opcache health on mw2305 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:50:08] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020): Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10Aklapper) Thanks everyone! * CLI: Works; done. ** Running `ssh stat1007` and entering the command `hive` works now (after following the Kerbero... [10:52:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/21731/mw1261.eqiad.wmnet/ the change does what's expected of it, but I'll wait for Alex/R" [puppet] - 10https://gerrit.wikimedia.org/r/587227 (https://phabricator.wikimedia.org/T249535) (owner: 10Giuseppe Lavagetto) [11:01:45] (03PS1) 10Addshore: Remove old unused RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587230 (https://phabricator.wikimedia.org/T203888) [11:06:19] (03PS1) 10Addshore: RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) [11:07:40] !log starting recovery on all s8 hosts [11:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:46] * addshore is going to deploy that first one removing an old hook [11:09:43] !log Deploy schema change on s3 codfw [11:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:11] (03CR) 10Addshore: [C: 04-2] "As there is a TODO in there" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [11:23:09] (03CR) 10JMeybohm: "What do you think about naming the systemd less generic?" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 (owner: 10Muehlenhoff) [11:30:51] (03CR) 10Jakob: [C: 03+1] Remove old unused RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587230 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [11:31:02] (03CR) 10Addshore: [C: 03+2] Remove old unused RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587230 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [11:31:59] (03Merged) 10jenkins-bot: Remove old unused RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587230 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [11:32:57] !stopped the rebuilt script (T157651) [11:32:57] T157651: sql.php runs LoadExtensionSchemaUpdates - https://phabricator.wikimedia.org/T157651 [11:33:06] Wrong ticket :( [11:34:58] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: cleanup T203888, Remove old unused RejectParserCacheValue hook (duration: 00m 59s) [11:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:04] T203888: Turn on Sense support on Wikidata - https://phabricator.wikimedia.org/T203888 [11:36:02] addshore: Amir1: I will be there this afternoon if you need assistance with the train blockers [11:36:05] !log stopped the rebuilt script (T249565) [11:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:10] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [11:36:15] hashar: Thanks [11:36:32] in roughly an hour I guess. Still have to lunch [11:42:11] PSA: We are going to go read only for a bit to bring back the table from a back up [11:42:43] (03PS1) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [11:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092, db1111, db1099:3318 for table rename', diff saved to https://phabricator.wikimedia.org/P10929 and previous config saved to /var/cache/conftool/dbconfig/20200407-114258-marostegui.json [11:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:42] (03CR) 10Dzahn: "going through envoy so use port 444 or something else -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/587233" [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:45:28] !log stopping s8 replication on db1116:3318, db1095:3318, db2079 [11:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:25] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:47:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:47:59] (03PS2) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [11:48:15] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:48:43] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:48:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: queens: drop python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/583680 (https://phabricator.wikimedia.org/T242766) (owner: 10Arturo Borrero Gonzalez) [11:49:12] it's recovering [11:50:26] !log renaming wb_items_per_site_recovered to wb_items_per_site on s8 [11:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1092, db1111, db1099:3318 after table rename', diff saved to https://phabricator.wikimedia.org/P10930 and previous config saved to /var/cache/conftool/dbconfig/20200407-115058-marostegui.json [11:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:51:31] yup, looks good [11:51:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:51:54] (03PS3) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [11:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'depool db1126', diff saved to https://phabricator.wikimedia.org/P10931 and previous config saved to /var/cache/conftool/dbconfig/20200407-115154-marostegui.json [11:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:10] it exploded again [11:52:12] (03CR) 10DCausse: wdqs: better exception and threading handling during file transfer. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 (owner: 10Gehel) [11:52:15] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:52:24] (03PS4) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [11:52:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'repool db1126', diff saved to https://phabricator.wikimedia.org/P10932 and previous config saved to /var/cache/conftool/dbconfig/20200407-115228-marostegui.json [11:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:54:02] fatals have recovered [11:54:51] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:55:48] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:59:38] (03CR) 10Tarrow: "Looks good to me; happy to see it merged when the TODO is in." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [12:01:20] (03PS5) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [12:04:02] addshore: Do you have the list of the items changed? [12:04:11] yes [12:04:21] https://phabricator.wikimedia.org/T249596#6035910 [12:04:26] just trying to decide on the batch size now [12:04:42] i staretd runnning and imediatly saw [12:04:45] Saving sitelinks for Item Q18 failed [12:04:45] Saving sitelinks for Item Q336 failed [12:04:55] so was going to quickly investigate a little bit [12:05:05] 10Operations: persistent cronspam from Cron Daemon - https://phabricator.wikimedia.org/T247608 (10Dzahn) 05Open→03Resolved No more of these cron spam mails that I can see since about 4 days. [12:05:19] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [12:05:20] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [12:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:26] looks like normal batch size will be fine [12:05:33] I'll jsut make sure i save the logs somewhere [12:05:45] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:06:28] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php --wiki=wikidatawiki --file T249596-4.list > T249596-4.out # T249565 T249596 [12:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:36] T249596: Rebuild wb_items_per_site, after incident where wb_items_per_site was dropped - https://phabricator.wikimedia.org/T249596 [12:06:36] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [12:06:54] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The new plan is to do TLS termination in envoy rather than in nodejs itself. Hence the new patch above to... [12:10:47] (03CR) 10WMDE-leszek: "> Given we did this a few times I'd think a time based ParserCache invalidator might be in order." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [12:10:48] 10Operations: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10MoritzMuehlenhoff) [12:14:33] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/586372 (https://phabricator.wikimedia.org/T248858) (owner: 10MSantos) [12:15:14] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/585517 (owner: 10Jbond) [12:18:23] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22409 bytes in 0.521 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:20:09] (03CR) 10Hoo man: RejectParserCacheValue entries during wb_items_per_site drop incident (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [12:24:03] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:25:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:59] (03CR) 10Gehel: wdqs: better exception and threading handling during file transfer. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 (owner: 10Gehel) [12:31:12] 10Operations, 10Repository-Admins, 10Traffic: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10ema) [12:34:13] (03CR) 10DCausse: [C: 03+1] wdqs: better exception and threading handling during file transfer. (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 (owner: 10Gehel) [12:42:06] !log restart ats-tls on cp3058 - T249335 [12:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:12] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [12:42:57] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php --wiki=wikidatawiki --file T249596-5.list > T249596-5.out # T249565 [12:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:06] T249596: Rebuild wb_items_per_site, after incident where wb_items_per_site was dropped - https://phabricator.wikimedia.org/T249596 [12:43:07] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [12:43:55] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22359 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:44:32] (03PS1) 10Elukey: role::druid::public::worker: enable SQL query [puppet] - 10https://gerrit.wikimedia.org/r/587249 [12:46:03] 10Operations: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10JMeybohm) [12:49:12] (03CR) 10Ottomata: [C: 03+1] Add refine failure flag check for Eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/587203 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [12:50:17] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: enable SQL query [puppet] - 10https://gerrit.wikimedia.org/r/587249 (owner: 10Elukey) [12:50:41] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "After talk with Faidon and Arzhel on IRC, it seems this change is not desirable from prod networking point of view." [puppet] - 10https://gerrit.wikimedia.org/r/585031 (https://phabricator.wikimedia.org/T247505) (owner: 10Andrew Bogott) [12:50:45] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php --wiki=wikidatawiki --file T249596-6.list > T249596-6.out # T249565 [12:50:49] 10Operations: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10JMeybohm) [12:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:53] T249596: Rebuild wb_items_per_site, after incident where wb_items_per_site was dropped - https://phabricator.wikimedia.org/T249596 [12:50:53] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [12:52:00] (03CR) 10Elukey: [C: 03+2] Move profile::refinery::job::data_check to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/587202 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [12:52:02] (03PS2) 10Elukey: Add refine failure flag check for Eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/587203 (https://phabricator.wikimedia.org/T240230) [12:53:38] (03PS1) 10Muehlenhoff: Add DHCP config for idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/587250 [12:53:55] (03CR) 10Gehel: wdqs: better exception and threading handling during file transfer. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 (owner: 10Gehel) [12:53:57] (03CR) 10Elukey: [C: 03+2] Add refine failure flag check for Eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/587203 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [12:54:52] (03CR) 10Addshore: [C: 04-2] RejectParserCacheValue entries during wb_items_per_site drop incident (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [12:54:54] (03PS4) 10Ottomata: refine - look for schemas both primary and secondary schema repositories [puppet] - 10https://gerrit.wikimedia.org/r/586356 (https://phabricator.wikimedia.org/T240985) [12:56:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] refine - look for schemas both primary and secondary schema repositories [puppet] - 10https://gerrit.wikimedia.org/r/586356 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [12:57:25] (03CR) 10Hoo man: RejectParserCacheValue entries during wb_items_per_site drop incident (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [12:58:41] oh elukey l puppet-merged your change [12:58:47] for failed flags [12:58:55] i also just merged a refine related change [12:59:03] (03CR) 10Addshore: [C: 04-2] RejectParserCacheValue entries during wb_items_per_site drop incident (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [12:59:18] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom, and 3 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) Taking this off the clinic duty board. This needs system design / strategy. I'm taggin... [12:59:57] !log restart ats-tls on cp3052- T249335 [13:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:03] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [13:00:46] ottomata: thanks! [13:01:29] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Improve PoolCounterWork logic to cover possible raised exceptions - https://phabricator.wikimedia.org/T249531 (10daniel) p:05Triage→03Medium [13:01:59] (03PS6) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [13:04:14] (03PS1) 10Nikerabbit: Enable MassMessage logging on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587251 (https://phabricator.wikimedia.org/T165128) [13:08:46] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP config for idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/587250 (owner: 10Muehlenhoff) [13:09:43] (03PS2) 10Addshore: RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) [13:10:39] (03CR) 10jerkins-bot: [V: 04-1] RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [13:11:56] (03PS1) 10Filippo Giunchedi: debian: first commit [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/587252 [13:12:36] (03PS2) 10Filippo Giunchedi: debian: first commit [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/587252 (https://phabricator.wikimedia.org/T233956) [13:13:45] 10Operations, 10serviceops, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10akosiaris) >>! In T249535#6035576, @Joe wrote: > Ok, I found the culprit: > > - private wikis set the c... [13:14:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 on the idea, got a couple of minor nitpicks but feel free to ignore." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/587227 (https://phabricator.wikimedia.org/T249535) (owner: 10Giuseppe Lavagetto) [13:14:32] (03PS1) 10Elukey: profile::analytics::refinery::job::data_check: set deploy-mode client [puppet] - 10https://gerrit.wikimedia.org/r/587253 (https://phabricator.wikimedia.org/T240230) [13:17:20] !log restart ats-tls on cp3056 - T249335 [13:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:26] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [13:18:40] (03PS2) 10Elukey: profile::analytics::refinery::job::data_check: set deploy-mode client [puppet] - 10https://gerrit.wikimedia.org/r/587253 (https://phabricator.wikimedia.org/T240230) [13:19:02] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery::job::data_check: set deploy-mode client [puppet] - 10https://gerrit.wikimedia.org/r/587253 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [13:21:37] (03PS1) 10Ottomata: Remove now unused mediawiki/event-schemas repo [puppet] - 10https://gerrit.wikimedia.org/r/587255 (https://phabricator.wikimedia.org/T240985) [13:24:07] (03PS3) 10Elukey: profile::analytics::refinery::job::data_check: set deploy-mode client [puppet] - 10https://gerrit.wikimedia.org/r/587253 (https://phabricator.wikimedia.org/T240230) [13:24:49] (03CR) 10Gehel: [C: 03+2] wdqs: better exception and threading handling during file transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/587198 (owner: 10Gehel) [13:26:46] (03CR) 10Hoo man: "Looks good to me." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [13:27:40] (03CR) 10Ottomata: "This is true, but it does mean that the RefineTarget.find call will run locally, which isn't an insignificant amount of work. Probably ok" [puppet] - 10https://gerrit.wikimedia.org/r/587253 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [13:28:18] (03PS3) 10Addshore: RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) [13:29:21] (03CR) 10Addshore: [C: 04-2] RejectParserCacheValue entries during wb_items_per_site drop incident (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [13:31:01] (03CR) 10Elukey: "> This is true, but it does mean that the RefineTarget.find call will" [puppet] - 10https://gerrit.wikimedia.org/r/587253 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [13:31:06] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1003/21737/" [puppet] - 10https://gerrit.wikimedia.org/r/587255 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [13:31:12] (03PS4) 10Addshore: RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) [13:31:16] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:49] this is surely me --^ [13:31:53] checking [13:32:09] k [13:32:12] yep, fixing [13:33:57] (03PS1) 10KartikMistry: Enable ContentTranslation in Slovenian WP as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587257 (https://phabricator.wikimedia.org/T248836) [13:38:08] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:40:14] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:23] (03PS1) 10Hnowlan: changeprop: use correct tag for null_edit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/587260 [13:42:34] (03PS1) 10Muehlenhoff: Add site.pp entry for idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/587261 [13:49:53] (03CR) 10Jakob: [C: 03+1] RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [13:49:57] (03CR) 10Addshore: [C: 03+2] RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [13:50:07] * addshore will be deploying that soon [13:50:12] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Maps (Tilerator), 10Patch-For-Review: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Pchelolo) We're going to remove support for this from #changeprop as well as a part of k8s transition. If there are a... [13:51:14] (03Merged) 10jenkins-bot: RejectParserCacheValue entries during wb_items_per_site drop incident [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587231 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [13:51:32] (03CR) 10Muehlenhoff: [C: 03+2] Add site.pp entry for idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/587261 (owner: 10Muehlenhoff) [13:55:21] !log addshore@deploy1001 sync-file aborted: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (1h) (duration: 00m 29s) [13:55:25] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020): Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03elukey Superset needs an internal user created, this is handled by the analytics team, reassignin... [13:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:28] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [13:55:28] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [13:55:32] (03PS1) 10Addshore: Revert "RejectParserCacheValue entries during wb_items_per_site drop incident" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587262 [13:55:36] (03CR) 10Addshore: [V: 03+2 C: 03+2] Revert "RejectParserCacheValue entries during wb_items_per_site drop incident" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587262 (owner: 10Addshore) [13:56:26] huh, what went wrong? [13:57:04] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: REVERT T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (1h) (duration: 00m 58s) [13:57:05] * addshore goes and re finds the log [13:57:09] only made it to canaries [13:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:21] Error from line 3467 of /srv/mediawiki/wmf-config/CommonSettings.php: Function name must be a string [13:57:47] (03PS1) 10Addshore: RejectParserCacheValue entries during wb_items_per_site drop incident 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587264 [13:58:18] Ah, call_user_func?! [13:58:22] yup [13:58:48] (03PS1) 10Ayounsi: Logstash: parse Juniper PFE firewall syslog [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) [13:59:43] hoo: not actually sure thogh, $wgWBClientSettings['excludeNamespaces']() works for me [14:00:32] Maybe it's not set in some weird condition (or already resolved… is Wikibase storing the resolved value back into the array?) [14:00:34] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020): Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10elukey) >>! In T248905#6035721, @Aklapper wrote: > Thanks everyone! > > * CLI: Works; done. > ** Running `ssh stat1007` and entering the comman... [14:00:43] hmmmmmmmmmmm [14:00:59] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_check: set deploy-mode client [puppet] - 10https://gerrit.wikimedia.org/r/587253 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [14:01:29] hoo: I might just remove and continue without the namespace check [14:01:49] I dont think we will cause too many unneeded re parses in the grand scheme of things [14:01:56] Probably [14:03:12] (03CR) 10jerkins-bot: [V: 04-1] Logstash: parse Juniper PFE firewall syslog [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [14:03:14] (03PS2) 10Addshore: RejectParserCacheValue entries during wb_items_per_site drop incident 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587264 (https://phabricator.wikimedia.org/T249565) [14:03:15] hoo: ^^ please review :) [14:03:30] (03PS2) 10Hnowlan: changeprop: use correct tag for null_edit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/587260 (https://phabricator.wikimedia.org/T248677) [14:04:38] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Maps (Tilerator), 10Patch-For-Review: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) Let's leave it open for now. We may have some dedicated maps maintenance capacity in the next couple quar... [14:04:51] (03CR) 10Hoo man: [C: 03+1] RejectParserCacheValue entries during wb_items_per_site drop incident 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587264 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:05:20] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22389 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:05:24] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) [14:05:43] (03CR) 10Addshore: [C: 03+2] RejectParserCacheValue entries during wb_items_per_site drop incident 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587264 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:05:50] lets try that again [14:05:59] Instead of the window, we could have gone with purging in 10% of the cases or something like that (or even automatically upscaling that with progressing time)… I think I'm again way over-engineering this [14:06:11] :D [14:06:25] oooh automaticaly advancing the window would be cool [14:06:39] (03Merged) 10jenkins-bot: RejectParserCacheValue entries during wb_items_per_site drop incident 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587264 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:06:44] (03PS2) 10Ayounsi: Logstash: parse Juniper PFE firewall syslog [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) [14:06:46] (03CR) 10Jbond: [C: 03+2] profile::tlsprox::envoy: update request_timeout parameter [puppet] - 10https://gerrit.wikimedia.org/r/585517 (owner: 10Jbond) [14:07:41] hoo: I wonder why i didnt spot that error when testing on mwdebug1002 too...... [14:07:42] heh [14:07:58] addshore I think the reason the namespace check failed was an extra () at the end [14:08:03] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (1h) take 2 (duration: 00m 57s) [14:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:10] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [14:08:10] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [14:08:18] (03PS1) 10Andrew Bogott: Neutron/rocky: add l3_agent_hacks that include the dmz_cidr change [puppet] - 10https://gerrit.wikimedia.org/r/587266 [14:09:02] (03PS1) 10DannyS712: RejectParserCacheValue entries during wb_items_per_site drop incident: namespace check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587267 (https://phabricator.wikimedia.org/T249565) [14:09:05] (03CR) 10jerkins-bot: [V: 04-1] Neutron/rocky: add l3_agent_hacks that include the dmz_cidr change [puppet] - 10https://gerrit.wikimedia.org/r/587266 (owner: 10Andrew Bogott) [14:09:15] DannyS712: see Wikibase.php in mediawiki-config, it is a callable function defined there. but there might be some other magic somewhere else that changes that fact [14:10:37] (03PS2) 10Andrew Bogott: Neutron/rocky: add l3_agent_hacks that include the dmz_cidr change [puppet] - 10https://gerrit.wikimedia.org/r/587266 (https://phabricator.wikimedia.org/T247505) [14:10:55] Good good, seeing the rejections come though at a sensible rate https://usercontent.irccloud-cdn.com/file/4dpBcEmr/image.png [14:11:20] (03Abandoned) 10DannyS712: RejectParserCacheValue entries during wb_items_per_site drop incident: namespace check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587267 (https://phabricator.wikimedia.org/T249565) (owner: 10DannyS712) [14:12:32] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:12:45] (03CR) 10Andrew Bogott: [C: 03+1] "any reason for me not to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/581105 (owner: 10Alex Monk) [14:12:53] (03PS1) 10Addshore: RejectParserCache entries for wb_items_per_site 2/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587268 (https://phabricator.wikimedia.org/T249565) [14:13:09] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10CDanis) @MusikAnimal Yeah, you shouldn't expect to see any request data in your tcpdumps -- it'll all be TLS-encrypted. B... [14:13:29] (03CR) 10Addshore: [C: 03+2] RejectParserCache entries for wb_items_per_site 2/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587268 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:14:23] (03Merged) 10jenkins-bot: RejectParserCache entries for wb_items_per_site 2/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587268 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:15:09] (03PS1) 10Elukey: Add support for deploy-mode client in spark refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/587269 [14:15:11] (03CR) 10Ppchelko: [C: 04-1] "Forgot a little bit" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/587260 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:15:52] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (2/14.5h) (duration: 00m 58s) [14:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:59] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [14:15:59] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [14:21:28] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22398 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:22:16] (03CR) 10Elukey: [C: 03+2] Add support for deploy-mode client in spark refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/587269 (owner: 10Elukey) [14:22:36] (03PS1) 10Addshore: RejectParserCache entries for wb_items_per_site 4/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587271 (https://phabricator.wikimedia.org/T249565) [14:22:51] (03CR) 10Addshore: [C: 03+2] RejectParserCache entries for wb_items_per_site 4/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587271 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:24:08] (03Merged) 10jenkins-bot: RejectParserCache entries for wb_items_per_site 4/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587271 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:24:52] (03PS2) 10Elukey: Add support for deploy-mode client in spark refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/587269 [14:25:32] 10Operations, 10Core Platform Team, 10Parsing-Team, 10Performance-Team, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10cscott) [14:25:34] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (4/14.5h) (duration: 00m 58s) [14:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:40] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [14:25:41] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [14:26:21] (03PS3) 10Hnowlan: changeprop: use correct tag for null_edit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/587260 (https://phabricator.wikimedia.org/T248677) [14:27:18] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:27:46] 10Operations, 10Core Platform Team, 10Parsing-Team, 10Performance-Team, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10cscott) Putting this on the (long-term!) radar of the parsing team. Since we are hoping... [14:29:45] (03CR) 10Alex Monk: "Can't think of one" [puppet] - 10https://gerrit.wikimedia.org/r/581105 (owner: 10Alex Monk) [14:30:00] (03PS1) 10Addshore: RejectParserCache entries for wb_items_per_site 8/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587273 (https://phabricator.wikimedia.org/T249565) [14:30:09] (03PS3) 10Elukey: Add support for deploy-mode client in spark refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/587269 [14:30:58] (03CR) 10Addshore: [C: 03+2] RejectParserCache entries for wb_items_per_site 8/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587273 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:31:42] (03Merged) 10jenkins-bot: RejectParserCache entries for wb_items_per_site 8/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587273 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:32:19] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21740/" [puppet] - 10https://gerrit.wikimedia.org/r/587269 (owner: 10Elukey) [14:34:28] (03CR) 10Elukey: [C: 03+2] Add support for deploy-mode client in spark refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/587269 (owner: 10Elukey) [14:35:02] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (8/14.5h) (duration: 00m 58s) [14:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:10] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [14:35:10] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [14:36:47] (03CR) 10Ppchelko: [C: 03+2] changeprop: use correct tag for null_edit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/587260 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:37:05] (03Merged) 10jenkins-bot: changeprop: use correct tag for null_edit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/587260 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:37:11] addshore: Application server req/s are growing, let's let them catch up for a bit? [14:37:50] hoo: yup, im staring at https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-15m&to=now&refresh=10s [14:37:58] and https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-3h&to=now [14:38:08] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22400 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:38:09] I'm looking at the same two things [14:38:21] !log cp3052: stop vhtcpd, start purged T249583 [14:38:22] :) [14:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:26] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [14:39:02] hoo: will you be around for the next little while? [14:39:28] Probably like 30 more minutes [14:39:59] aah oay! [14:40:11] * addshore goes to find another wmde one! [14:46:40] I'll probably leave it at this rate for a bit [14:50:44] actually the app servers look fine, the increase in response is just due to the reparses but not an indication of anything being overloaded [14:52:14] (03PS1) 10Addshore: RejectParserCache entries for wb_items_per_site 10/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587277 (https://phabricator.wikimedia.org/T249565) [14:52:24] !log cloudvirt2003-dev: downtime in icinga and reboot to enable BIOS virtualization support T249453 [14:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:30] T249453: Unable to create networking for new VMs in codfw1-dev - https://phabricator.wikimedia.org/T249453 [14:52:36] (03CR) 10Addshore: [C: 03+2] RejectParserCache entries for wb_items_per_site 10/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587277 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:53:31] (03Merged) 10jenkins-bot: RejectParserCache entries for wb_items_per_site 10/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587277 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [14:53:44] PROBLEM - Varnish HTCP daemon on cp3052 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [14:56:27] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (10/14.5h) (duration: 00m 55s) [14:56:28] (03PS7) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [14:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [14:56:34] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [14:59:02] appservers load seems fine, average response time is obviously elevated, I'll wait a little bit more for the next bump [15:00:10] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [15:00:18] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:50] 10Operations, 10Traffic: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema) I am testing a first iteration of `purged` (T249583) on cp3052. The program sends PURGEs over multiple TCP connections, and ats-be is now doing much better:... [15:10:04] RECOVERY - Varnish HTCP daemon on cp3052 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [15:10:37] !log cp3052: stop purged, start vhtcpd T249583 T241232 [15:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:44] T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 [15:10:44] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [15:14:02] (03PS1) 10Addshore: RejectParserCache entries for wb_items_per_site 12/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587280 (https://phabricator.wikimedia.org/T249565) [15:15:15] (03CR) 10Addshore: [C: 03+2] RejectParserCache entries for wb_items_per_site 12/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587280 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [15:15:26] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.357e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:16:08] (03Merged) 10jenkins-bot: RejectParserCache entries for wb_items_per_site 12/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587280 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [15:17:52] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (12/14.5h) (duration: 01m 00s) [15:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:59] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [15:18:00] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [15:19:06] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:20:07] !log enable uRPF loose mode (log only) on cr4-ulsfo - T244147 [15:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:16] !log installing idp-test2001 [15:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:27:13] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020): Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10Nuria) @Aklapper let's see: - @srishakatux work was not done in superset so superset does not have access to her data (superset is for internal... [15:27:16] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22401 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:28:59] 10Operations: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10Dsharpe) [15:30:20] 10Operations: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10Dsharpe) [15:34:24] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Nuria) If @Tchanders: tchanders, @dmaza: dmaza, @dbarratt: dbarratt, @Mooeypoo need access to... [15:34:52] (03PS1) 10Ayounsi: Add uRPF loose mode [homer/public] - 10https://gerrit.wikimedia.org/r/587281 (https://phabricator.wikimedia.org/T244147) [15:35:51] (03PS2) 10Andrew Bogott: cloud eqiad1: Remove references to old cloud-puppetmaster stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/581105 (owner: 10Alex Monk) [15:36:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Neutron/rocky: add l3_agent_hacks that include the dmz_cidr change [puppet] - 10https://gerrit.wikimedia.org/r/587266 (https://phabricator.wikimedia.org/T247505) (owner: 10Andrew Bogott) [15:37:18] (03CR) 10Andrew Bogott: [C: 03+2] cloud eqiad1: Remove references to old cloud-puppetmaster stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/581105 (owner: 10Alex Monk) [15:39:48] (03PS1) 10Hnowlan: Changeprop: Use HTTPS eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/587282 (https://phabricator.wikimedia.org/T248677) [15:39:54] PROBLEM - PHP opcache health on mw2364 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:39:57] (03PS3) 10Andrew Bogott: Neutron/rocky: add l3_agent_hacks that include the dmz_cidr change [puppet] - 10https://gerrit.wikimedia.org/r/587266 (https://phabricator.wikimedia.org/T247505) [15:39:59] (03PS1) 10Andrew Bogott: codfw1dev: turn the dmz_cidr_hack back on [puppet] - 10https://gerrit.wikimedia.org/r/587283 (https://phabricator.wikimedia.org/T247505) [15:42:16] (03CR) 10Andrew Bogott: [C: 03+2] Neutron/rocky: add l3_agent_hacks that include the dmz_cidr change [puppet] - 10https://gerrit.wikimedia.org/r/587266 (https://phabricator.wikimedia.org/T247505) (owner: 10Andrew Bogott) [15:43:08] (03CR) 10Ppchelko: "Do we still need to allow access to 32192 in calico?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587282 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:44:25] (03PS2) 10Hnowlan: Changeprop: Use HTTPS eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/587282 (https://phabricator.wikimedia.org/T248677) [15:45:23] 10Operations, 10Traffic: Implement TTL cap for ats-be - https://phabricator.wikimedia.org/T249627 (10ema) [15:45:37] 10Operations, 10Traffic: Implement TTL cap for ats-be - https://phabricator.wikimedia.org/T249627 (10ema) p:05Triage→03Medium [15:46:59] (03CR) 10Aaron Schulz: [C: 04-1] "HtmlCacheUpdateJob directly uses CdnCacheUpdate::purge(), which bypasses rebound purges. $wgCdnReboundPurgeDelay is only meant for purges " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586390 (https://phabricator.wikimedia.org/T249325) (owner: 10CDanis) [15:50:24] (03CR) 10Ppchelko: [C: 03+2] Changeprop: Use HTTPS eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/587282 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:50:39] (03Merged) 10jenkins-bot: Changeprop: Use HTTPS eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/587282 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:53:34] (03PS8) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [15:53:48] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:52] 10Operations, 10serviceops, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10cscott) @joe could you take a look at https://gerrit.wikimedia.org/r/579021 and subsequent patches as we... [15:55:24] 10Operations, 10User-jbond: Create a staging environment for CAS - https://phabricator.wikimedia.org/T233930 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [15:55:49] 10Operations, 10vm-requests: codfw: 1 VM request for idp staging host - https://phabricator.wikimedia.org/T249594 (10MoritzMuehlenhoff) 05Open→03Resolved idp-test2001.wikimedia.org has been created, rest of the setup is handled via T233930 [15:58:24] (03PS1) 10Krinkle: mediawiki/maintenance: add startupregistrystats for test.wp.o and mw.o [puppet] - 10https://gerrit.wikimedia.org/r/587286 (https://phabricator.wikimedia.org/T233678) [15:59:02] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: turn the dmz_cidr_hack back on [puppet] - 10https://gerrit.wikimedia.org/r/587283 (https://phabricator.wikimedia.org/T247505) (owner: 10Andrew Bogott) [16:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200407T1600). [16:00:04] Krinkle: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:01:32] (03Abandoned) 10CRusnov: Update Netbox to v2.7.8 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/575580 (owner: 10CRusnov) [16:02:07] (03CR) 10CRusnov: [C: 03+2] reports/accounting: fix a few docstring issues [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584042 (owner: 10Faidon Liambotis) [16:03:05] (03CR) 10CRusnov: [C: 03+2] reports/accounting: bump Python minimum to 3.7 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584041 (owner: 10Faidon Liambotis) [16:04:44] Krinkle: looking at your patches [16:04:49] (03PS9) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [16:04:56] godog: okay :) [16:05:44] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10JHedden) I think this is the full conversation stream between xtools and de.wikipedia.org (after URI: http://xtools.wmflab... [16:07:18] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki/maintenance: add startupregistrystats for test.wp.o and mw.o [puppet] - 10https://gerrit.wikimedia.org/r/587286 (https://phabricator.wikimedia.org/T233678) (owner: 10Krinkle) [16:10:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/585968 looks good to me, _joe_ I'm assuming that is safe to deploy and puppet-merge is enough ? [16:11:46] <_joe_> godog: yes [16:12:04] thanks! merging [16:12:05] <_joe_> it should be, but ask vgutierrez regarding it I guess [16:12:07] (03CR) 10Filippo Giunchedi: [C: 03+2] apache: restore redirect from stats.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/585968 (https://phabricator.wikimedia.org/T126281) (owner: 10Krinkle) [16:12:20] (03PS2) 10Filippo Giunchedi: apache: restore redirect from stats.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/585968 (https://phabricator.wikimedia.org/T126281) (owner: 10Krinkle) [16:13:03] yeah it looks sane [16:13:11] don't get the NOOP change on nc_redirects.dat though [16:13:51] yeah fair, seems just cosmetic Krinkle [16:13:58] RECOVERY - PHP opcache health on mw2364 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:15:37] (03PS1) 10Krinkle: mediawiki: Document the apache sample hosts [puppet] - 10https://gerrit.wikimedia.org/r/587289 (https://phabricator.wikimedia.org/T244472) [16:15:49] vgutierrez: I was editing there fist, forgot to undo :/ [16:16:01] np :) [16:16:03] although the whitespace is a bit inconsistnet in that file. but should've been seprate commit [16:16:38] !log restarting CI jenkins [16:16:41] FWIW I'm ok to merge as-is, as soon as CI completes [16:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:55] godog: cool, let me know when I should test/verify. [16:17:07] _joe_: doc change if you like - https://gerrit.wikimedia.org/r/587289 [16:17:53] (03PS3) 10CRusnov: netbox (hiera): Add coherence.Rack to alerted reports [puppet] - 10https://gerrit.wikimedia.org/r/578551 (https://phabricator.wikimedia.org/T239244) [16:18:17] Krinkle: should be rolled out in the next 30 min now, both patches merged [16:18:56] _joe_: regarding opcache - did you want to test in codfw or a depooled eqiad? thinking might be easier in eqiad for apples-to-apples comparison, but I guess we could also compare another depooled codfw host. let me know :) writing the puppet patch now. [16:19:16] <_joe_> Krinkle: I'm off-ish but I think it's easier in eqiad, yes [16:19:42] okay, np [16:20:21] (03CR) 10Herron: "Yes indeed! One question -- will relforge kibana need the phatality plugin installed? https://phabricator.wikimedia.org/phame/post/view/1" [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [16:21:50] (03PS1) 10Ssingh: aptrepo: add Postgres repo for cescout role [puppet] - 10https://gerrit.wikimedia.org/r/587290 (https://phabricator.wikimedia.org/T247273) [16:24:01] !log 1.35.0-wmf.27 was branched at e76ac29cd9c57bed4097ec8a4ea8311fb55fd967 for T247774 [16:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:08] T247774: 1.35.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T247774 [16:24:17] (03PS10) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [16:25:04] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:29:29] (03PS11) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [16:31:10] 10Operations, 10Repository-Admins, 10Traffic: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10Dzahn) Please see https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [16:32:09] (03CR) 10Mstyles: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [16:32:14] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22399 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:33:33] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Dzahn) Hi @jpita Does the above sound good to you? [16:36:20] (03CR) 10Elukey: [C: 03+1] kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [16:37:17] (03PS12) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [16:38:55] (03CR) 10CRusnov: [V: 03+2 C: 03+2] netbox (hiera): Add coherence.Rack to alerted reports [puppet] - 10https://gerrit.wikimedia.org/r/578551 (https://phabricator.wikimedia.org/T239244) (owner: 10CRusnov) [16:40:01] (03PS1) 10Andrew Bogott: cloud-vps: add vm client packages for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/587292 (https://phabricator.wikimedia.org/T248635) [16:40:38] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:41:07] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [16:41:24] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) >>! In T249035#6036900, @JHedden wrote: > Can you confirm that you're seeing this on both xtools-prod06 and x... [16:41:43] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: add vm client packages for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/587292 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [16:41:59] 10Operations, 10netbox, 10Patch-For-Review: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10crusnov) This is complete, but leaving open for the subtasks which are issues related to this new check. [16:42:34] 10Operations, 10netbox, 10Patch-For-Review: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10crusnov) a:05crusnov→03RobH [16:44:27] (03Abandoned) 10CRusnov: Add more emacs things to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/507393 (owner: 10CRusnov) [16:45:18] (03PS1) 10Hnowlan: Changeprop: add puppet CA cert to environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) [16:47:48] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22390 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:49:44] (03PS13) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [16:53:51] (03CR) 10Ppchelko: "How exactly is this going to become an env variable?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [16:54:33] (03PS14) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [16:58:58] (03PS1) 10Krinkle: mediawiki: increase php7 opcache capacity on mw1407 [puppet] - 10https://gerrit.wikimedia.org/r/587299 [16:59:23] (03PS2) 10Krinkle: mediawiki: increase php7 opcache capacity on mw1407 [puppet] - 10https://gerrit.wikimedia.org/r/587299 (https://phabricator.wikimedia.org/T99740) [17:00:04] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200407T1700). [17:01:07] (03CR) 10Krinkle: "Marking as WIP as the server is not yet depooled." [puppet] - 10https://gerrit.wikimedia.org/r/587299 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [17:01:13] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) 05Open→03Resolved a:03Krinkle Confirmed via . It now... [17:01:15] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/21749/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [17:01:20] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) [17:05:06] (03CR) 10Dzahn: ""title": "/etc/envoy/listeners.d/00-tls_terminator_22280.yaml"," [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [17:06:06] (03CR) 1020after4: [C: 03+1] phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [17:07:13] (03CR) 1020after4: [C: 03+1] ATS/phabricator: directly talk wss:// to aphlict [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [17:09:55] (03CR) 10CDanis: "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586390 (https://phabricator.wikimedia.org/T249325) (owner: 10CDanis) [17:17:21] (03CR) 10Elukey: "Had a chat with Herron, and he suggested to avoid phatality if not needed. The plugin is added by default in the kibana class:" [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:27:14] (03PS9) 10Dzahn: ATS/phabricator: directly talk wss:// to aphlict [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) [17:27:16] (03CR) 10Muehlenhoff: aptrepo: add Postgres repo for cescout role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587290 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [17:29:34] (03CR) 10Hnowlan: "> How exactly is this going to become an env variable?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [17:30:54] (03CR) 10BBlack: [C: 03+1] mgmt: use netbox-generated data for ulsfo [dns] - 10https://gerrit.wikimedia.org/r/585545 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:33:12] (03CR) 10CDanis: [C: 03+1] Add uRPF loose mode [homer/public] - 10https://gerrit.wikimedia.org/r/587281 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [17:33:16] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Mooeypoo) @Nuria I'm a little confused, and I'd like to clarify something. I think the initia... [17:34:38] (03CR) 10BBlack: [C: 04-1] "There is no sensible default because everything about this is hardware-specific. The only sensible default is to not attempt to set the t" [puppet] - 10https://gerrit.wikimedia.org/r/563976 (owner: 10Filippo Giunchedi) [17:40:56] !log increasing eqiad.mediawiki.job.cirrusSearchElasticaWrite to 3 partitions T240702 [17:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:03] T240702: mediawiki.job.cirrusSearchElasticaWrite topics need more partitions! - https://phabricator.wikimedia.org/T240702 [17:44:17] (03PS2) 10Ssingh: aptrepo: add Postgres repo for cescout role [puppet] - 10https://gerrit.wikimedia.org/r/587290 (https://phabricator.wikimedia.org/T247273) [17:44:40] (03PS1) 10Huji: Restore the 'reviewer' gropu for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587301 (https://phabricator.wikimedia.org/T249643) [17:49:45] !log ppchelko@deploy1001 Started restart [cpjobqueue/deploy@83c93d1]: Try to make it notice new partitions T240702 [17:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:51] T240702: mediawiki.job.cirrusSearchElasticaWrite topics need more partitions! - https://phabricator.wikimedia.org/T240702 [17:49:53] jouncebot: now [17:49:53] For the next 0 hour(s) and 10 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200407T1700) [17:50:07] are mediawiki-config things happening or can I do one? [17:50:34] (03PS1) 10Addshore: RejectParserCache entries for wb_items_per_site 14.5/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587303 (https://phabricator.wikimedia.org/T249565) [17:51:09] (03PS2) 10Addshore: RejectParserCache entries for wb_items_per_site 14.5/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587303 (https://phabricator.wikimedia.org/T249565) [17:51:30] (03CR) 10Addshore: [C: 03+2] RejectParserCache entries for wb_items_per_site 14.5/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587303 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [17:52:26] (03Merged) 10jenkins-bot: RejectParserCache entries for wb_items_per_site 14.5/14.5 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587303 (https://phabricator.wikimedia.org/T249565) (owner: 10Addshore) [17:52:56] James_F: this is the last bit of the large bit [17:54:03] !log addshore@deploy1001 sync-file aborted: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (14.5/14.5h) (duration: 01m 16s) [17:54:03] (03CR) 10Ayounsi: [C: 03+2] Add uRPF loose mode [homer/public] - 10https://gerrit.wikimedia.org/r/587281 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [17:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident - https://phabricator.wikimedia.org/T249595 [17:54:09] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [17:54:11] !log last sync stuck on sync-masters [17:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:43] much better this time [17:55:21] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (14.5/14.5h) retry (duration: 01m 02s) [17:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:35] !log increasing codfw.mediawiki.job.cirrusSearchElasticaWrite to 3 partitions T240702 [17:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:40] T240702: mediawiki.job.cirrusSearchElasticaWrite topics need more partitions! - https://phabricator.wikimedia.org/T240702 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200407T1800) [18:24:30] 10Operations, 10netbox: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10RobH) 05Open→03Resolved a:05RobH→03None resolving as individual tasks for corrections are to be made by each onsite for their sites. thanks for setting this up! [18:24:46] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) [18:25:35] 10Operations, 10Performance-Team: Occasional NIC Tx bandwidth saturation for mc1027 - https://phabricator.wikimedia.org/T248962 (10aaron) I wonder if it would be useful for the template name to appear in the key when possible. Right now it's just an opaque hash. I doubt that many invocations of different templ... [18:28:26] 10Operations, 10DNS, 10Traffic: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) [18:30:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/587290 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [18:35:28] (03CR) 10Ssingh: [C: 03+2] aptrepo: add Postgres repo for cescout role [puppet] - 10https://gerrit.wikimedia.org/r/587290 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [18:42:45] (03CR) 10Papaul: [C: 04-1] backups: Assume backups have their ssds on sda and sdb for partman (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) (owner: 10Jcrespo) [18:48:54] !log jhuneidi@deploy1001 Pruned MediaWiki: 1.35.0-wmf.24 (duration: 12m 44s) [18:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:42] addshore: `Error (/srv/mediawiki/wmf-config/CommonSettings.php:3467) PHP Notice: Undefined variable: wgWBClientSettings` [18:54:10] addshore: 'Cos it's not in a if( $wmgUseWikibaseClient ) check [18:55:02] hmmm?? [18:55:13] Are you seeing that now? [18:55:38] Yes, top prod error. [18:55:48] I guess from wikitech? [18:55:53] whut........ [18:55:58] Everything else is a Wikibase client now? [18:56:07] Sorry, am in meeting. [18:56:11] Not a crisis, just irritating. [18:56:30] wgWBClientSettings is no longer used in the hook that I added today [18:56:40] Line 3467 [18:56:42] line 3467 currently deployed should be [18:56:42] return true; [18:57:11] I'll resync and see if it is 1 server or perhaps even just at the top from a bad sync from earlier [18:57:23] Oh, might be from earlier. [18:57:24] Sorry. [18:57:50] np :) yes probably from something earlier that got caught by the canaeris, slightly different error but pretty sure it is the same cause [18:58:06] Cool [18:58:51] But line 3487 has an un-guarded use of wgWBClientSetttings. [18:59:00] Should be fine, though. [19:00:04] longma and James_F: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200407T1900). [19:00:41] train is blocked still so no deployment at this time [19:01:12] * James_F nods. [19:01:44] longma: Is it just blocked by the Wikidata stuff? That's all clean-up now, I believe? [19:02:36] oh, let me confirm [19:03:10] James_F: It seems like this is a blocker: https://phabricator.wikimedia.org/T249565 [19:03:28] longma: Refresh the page. :-) [19:04:02] hah, I did see another comment saying to hold off unblocking it [19:04:43] 10Operations, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [19:05:04] I'll start deploying then [19:05:23] 10Operations, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) Roadmap alignment and any stewardship needs from CPT confirmed by Cindy. [19:06:25] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.35.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587312 [19:06:27] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.35.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587312 (owner: 10Jeena Huneidi) [19:06:48] whooo [19:07:21] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587312 (owner: 10Jeena Huneidi) [19:07:24] sorry, I didnt know if I should unlink the task or not or just state it is not a blocker any more! [19:07:33] addshore: No worries. [19:08:30] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.27 [19:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:01] !log push pfw firewall rules - T249650 [19:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:51] (03PS2) 10RLazarus: maintenance: Migrate updatetranslationstats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585795 (https://phabricator.wikimedia.org/T211250) [19:16:53] (03PS2) 10RLazarus: maintenance: Migrate echo_mail_batch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585796 (https://phabricator.wikimedia.org/T211250) [19:17:16] (03CR) 10jerkins-bot: [V: 04-1] maintenance: Migrate updatetranslationstats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585795 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:17:31] (03CR) 10jerkins-bot: [V: 04-1] maintenance: Migrate echo_mail_batch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585796 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:22:05] (03PS3) 10RLazarus: maintenance: Migrate updatetranslationstats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585795 (https://phabricator.wikimedia.org/T211250) [19:22:07] (03PS3) 10RLazarus: maintenance: Migrate echo_mail_batch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585796 (https://phabricator.wikimedia.org/T211250) [19:28:55] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp5006 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Varnish [19:31:30] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate updatetranslationstats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585795 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:32:24] 10Operations, 10DNS, 10Traffic: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Aklapper) 05Open→03Stalled @Bugreporter: Redirect what to what? Please be clearer and do follow https://www.mediawiki.org/wiki/How_to_report_a_bug . Clicking your first link goes to incubator, for e... [19:32:45] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.505 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:36:32] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate echo_mail_batch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585796 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:41:07] 10Operations, 10DNS, 10Traffic: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) [19:41:10] 10Operations, 10DNS, 10Traffic: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) 05Stalled→03Open [19:45:21] !log Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata RDF dumps start now (broke as a side effect of T249565) [19:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:27] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [19:47:30] hoo: If that drops the table again I'll be sad! ;-( [19:49:26] Indeed… but I think we fixed enough parts in the fatal chain to be safe for now [19:50:00] * James_F crosses fingers and toes. [19:50:19] it uses something else instead of sql.php [19:50:38] mysql.php (which fetches the credentials and then uses the mysql binary) [19:50:50] which is what mwscript sql uses [19:51:22] I mean plain sql on the deployment hosts [19:55:21] (03PS1) 10Hoo man: wikibasedumps-shared: Fix mysql.php path [puppet] - 10https://gerrit.wikimedia.org/r/587322 [19:57:45] Right. [20:02:26] (03CR) 10ArielGlenn: [C: 03+2] wikibasedumps-shared: Fix mysql.php path [puppet] - 10https://gerrit.wikimedia.org/r/587322 (owner: 10Hoo man) [20:08:58] !log (Take 2) Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata RDF dumps start now (broke as a side effect of T249565) [20:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:04] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [20:09:04] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.27 (duration: 60m 34s) [20:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:41] (03PS1) 10RLazarus: maintenance: Migrate parsercachepurging to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587324 (https://phabricator.wikimedia.org/T211250) [20:13:45] (03PS1) 10Jeena Huneidi: group0 wikis to 1.35.0-wmf.27 refs T247774 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587326 [20:13:47] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.35.0-wmf.27 refs T247774 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587326 (owner: 10Jeena Huneidi) [20:14:17] (03CR) 10jerkins-bot: [V: 04-1] maintenance: Migrate parsercachepurging to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587324 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [20:14:43] (03PS1) 10RLazarus: maintenance: migrate cleanup_upload_stash to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587327 (https://phabricator.wikimedia.org/T211250) [20:14:56] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.27 refs T247774 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587326 (owner: 10Jeena Huneidi) [20:16:00] (03PS2) 10RLazarus: maintenance: Migrate parsercachepurging to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587324 (https://phabricator.wikimedia.org/T211250) [20:17:26] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.27 refs T247774 [20:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:32] T247774: 1.35.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T247774 [20:19:38] longma: Looks OK to me so far. [20:20:30] cool, I think so too [20:26:29] (03PS1) 10RLazarus: maintenance: Migrate update_flaggedrev_stats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587328 (https://phabricator.wikimedia.org/T211250) [20:26:49] 10Operations, 10Parsing-Team, 10Performance-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10CCicalese_WMF) [20:28:46] 10Operations, 10Parsing-Team, 10Performance-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Anomie) If this gets to the point where there's a plan for the system to identify revisions that nee... [20:30:38] (03PS1) 10Hoo man: wikibasedumps-shared: Fix mysql.php group param [puppet] - 10https://gerrit.wikimedia.org/r/587329 [20:31:17] RhinosF1: we're running into some issues with packaging (again). First times are always more trouble. Hopefully we'll sort out everything tomorrow, same time. [20:31:24] RhinosF1: sorry for the delay! [20:31:44] No problem [20:32:07] (03CR) 10ArielGlenn: [C: 03+2] wikibasedumps-shared: Fix mysql.php group param [puppet] - 10https://gerrit.wikimedia.org/r/587329 (owner: 10Hoo man) [20:34:22] !log (Take 3) Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata RDF dumps start now (broke as a side effect of T249565) [20:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:29] T249565: Wikidata's wb_items_per_site table has suddenly disappeared, creating DBQueryErrors on page views - https://phabricator.wikimedia.org/T249565 [20:37:32] !log briefly downtiming serpens and seaborgium. I'm trying to investigate a possible split-brain so going to turn ldap off on one, and then the other, to see if behavior changes [20:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:17] (03PS1) 10RLazarus: maintenance: Migrate refreshlinks to period_job [puppet] - 10https://gerrit.wikimedia.org/r/587331 (https://phabricator.wikimedia.org/T211250) [20:39:14] !log correction: briefly downtiming ldap-eqiad-replica0 and ldap-eqiad-replica1. I'm trying to investigate a possible split-brain so going to turn ldap off on one, and then the other, to see if behavior changes [20:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:11] (03PS2) 10RLazarus: maintenance: Migrate refreshlinks to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587331 (https://phabricator.wikimedia.org/T211250) [20:53:27] PROBLEM - PHP opcache health on mw2360 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:54:09] (03PS1) 10RLazarus: maintenance: Migrate update_special_pages to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587334 (https://phabricator.wikimedia.org/T211250) [21:07:57] RECOVERY - PHP opcache health on mw2360 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:11:20] 10Operations, 10observability, 10serviceops: write some recording rules for queries used in the appserver RED dashboard - https://phabricator.wikimedia.org/T249663 (10CDanis) [21:11:45] (03CR) 10Bstorm: [C: 03+2] tools-static: apply SNI name setting to fontcdn as well [puppet] - 10https://gerrit.wikimedia.org/r/586475 (https://phabricator.wikimedia.org/T249558) (owner: 10Bstorm) [21:31:55] PROBLEM - Ensure local MW versions match expected deployment on wtp1025 is CRITICAL: CRITICAL: 130 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:35:10] ^ expected, the host is still depooled to work on T249535 [21:35:11] T249535: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 [21:49:03] (03CR) 10Bstorm: [C: 03+2] "Tested from a local checkout in toolsbeta, and this will fix issue #2 in that ticket! It does not fix issue #1, so I might fiddle with tha" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585846 (https://phabricator.wikimedia.org/T249390) (owner: 10BryanDavis) [21:49:52] (03Merged) 10jenkins-bot: Fix partial rename of "type" parameter to "wstype" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585846 (https://phabricator.wikimedia.org/T249390) (owner: 10BryanDavis) [21:51:47] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:54:49] 10Operations, 10DNS, 10Traffic: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10Reedy) [22:00:53] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:05:19] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:22] (03PS1) 10RobH: updating sku list [software] - 10https://gerrit.wikimedia.org/r/587362 [22:23:08] (03CR) 10RobH: [C: 03+2] updating sku list [software] - 10https://gerrit.wikimedia.org/r/587362 (owner: 10RobH) [22:24:08] (03CR) 10Aaron Schulz: [C: 04-1] "Mostly just the code. I rebased https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/528924/ , since it was meant to clean up this area of " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586390 (https://phabricator.wikimedia.org/T249325) (owner: 10CDanis) [22:29:12] 10Operations, 10Performance-Team: Occasional NIC Tx bandwidth saturation for mc1027 - https://phabricator.wikimedia.org/T248962 (10aaron) Hmm, it would help if cacheGetTree() /cacheSetTree() were replaced by getWithSetCallback() perhaps. Lots of optimizations are not used atm due to that fact. [22:53:57] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) [22:59:55] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200407T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:33] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:51] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) [23:02:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) [23:02:58] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) [23:03:26] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) [23:14:09] PROBLEM - PHP opcache health on mw2356 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:14:36] (03CR) 10Hashar: "The CI agents on WMCS are all on Stretch and thus Docker sticks to 18.06 (due to T236675)." [puppet] - 10https://gerrit.wikimedia.org/r/586203 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [23:15:01] (03PS2) 10Mstyles: kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) [23:16:26] (03PS1) 10Bstorm: args: A few fixups [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) [23:19:12] (03PS1) 10Cwhite: raid: add lsscsi to required packages for hpsa raid [puppet] - 10https://gerrit.wikimedia.org/r/587370 (https://phabricator.wikimedia.org/T199236) [23:19:14] (03CR) 10jerkins-bot: [V: 04-1] kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [23:22:14] (03CR) 10jerkins-bot: [V: 04-1] raid: add lsscsi to required packages for hpsa raid [puppet] - 10https://gerrit.wikimedia.org/r/587370 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [23:23:18] (03PS2) 10Cwhite: raid: add lsscsi to required packages for hpsa raid [puppet] - 10https://gerrit.wikimedia.org/r/587370 (https://phabricator.wikimedia.org/T199236) [23:24:04] (03PS1) 10Papaul: DNS: Remove mgmt DNS for cp200[1-2,4-8,10-14] [dns] - 10https://gerrit.wikimedia.org/r/587371 [23:24:24] (03PS3) 10Mstyles: kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) [23:32:17] RECOVERY - PHP opcache health on mw2356 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:35:54] (03PS1) 10Bstorm: toolforge: ensure the python2 backport of configparser is installed [puppet] - 10https://gerrit.wikimedia.org/r/587372 (https://phabricator.wikimedia.org/T249390) [23:36:45] (03PS4) 10Mstyles: kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) [23:36:55] (03CR) 10Bstorm: "Related to Id99087327c896958a" [puppet] - 10https://gerrit.wikimedia.org/r/587372 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [23:37:21] (03CR) 10BryanDavis: args: A few fixups (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [23:38:25] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for cp200[1-2,4-8,10-14] [dns] - 10https://gerrit.wikimedia.org/r/587371 (owner: 10Papaul) [23:41:22] (03CR) 10Bstorm: args: A few fixups (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [23:43:14] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Papaul) [23:43:45] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Papaul) 05Open→03Resolved Complete [23:43:57] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Papaul) [23:44:06] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Papaul) 05Open→03Resolved Complete [23:44:22] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) [23:44:32] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) 05Open→03Resolved complete [23:44:50] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) [23:45:03] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) 05Open→03Resolved Complete [23:45:18] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) [23:45:27] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) Complete [23:45:49] (03CR) 10Mstyles: "puppet compiler: https://puppet-compiler.wmflabs.org/compiler1003/21760/" [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [23:46:17] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) [23:46:30] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) 05Open→03Resolved Complete [23:46:46] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) [23:47:01] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) 05Open→03Resolved Complete [23:47:37] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) [23:47:44] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) Complete [23:47:55] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) [23:48:05] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) 05Open→03Resolved Complete [23:49:01] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) [23:49:20] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) 05Open→03Resolved Complete [23:50:21] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) [23:50:48] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) 05Open→03Resolved Complete [23:51:25] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Papaul) [23:51:41] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Papaul) 05Open→03Resolved Complete [23:52:09] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) 05Open→03Resolved [23:52:40] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) 05Open→03Resolved [23:55:31] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul) [23:55:51] 10Operations, 10ops-codfw, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Papaul) [23:56:52] (03CR) 10Bstorm: args: A few fixups (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm)