[00:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T0000). [00:03:27] (03CR) 10Cwhite: [C: 03+1] puppet_compiler: Add checks for missing facts files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/523709 (https://phabricator.wikimedia.org/T228266) (owner: 10Jbond) [01:42:57] (03PS1) 10Ayounsi: Anycast, make recdns VIP alerts page [puppet] - 10https://gerrit.wikimedia.org/r/524067 [01:45:02] (03CR) 10Ayounsi: "Feel free to abandon it if you think it's not worth it." [puppet] - 10https://gerrit.wikimedia.org/r/524067 (owner: 10Ayounsi) [01:45:51] (03PS2) 10Tim Starling: mediawiki: Fix undefined 'err' and 'message' in php7-fatal-error [puppet] - 10https://gerrit.wikimedia.org/r/524036 (https://phabricator.wikimedia.org/T228345) (owner: 10Krinkle) [01:46:12] (03CR) 10Tim Starling: [C: 03+2] mediawiki: Fix undefined 'err' and 'message' in php7-fatal-error [puppet] - 10https://gerrit.wikimedia.org/r/524036 (https://phabricator.wikimedia.org/T228345) (owner: 10Krinkle) [01:56:48] 10Operations, 10Performance-Team, 10observability: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [02:58:37] PROBLEM - PHP7 rendering on mw2250 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://10.192.0.76:9005/w/health-check.php - 380 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:09] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[fetch_mediawiki] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [03:05:43] PROBLEM - mediawiki-installation DSH group on mw2250 is CRITICAL: Host mw2250 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [03:20:29] (03PS1) 10Ayounsi: Anycast: move bird::neighbors_list from role/site to site in codfw [puppet] - 10https://gerrit.wikimedia.org/r/524076 [03:22:39] PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [03:28:01] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw2250 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:31:36] !log running query for T227843 on mwmaint102 [04:31:38] 1002* [04:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:45] T227843: Deprecate AbuseFilter's support for Zero - https://phabricator.wikimedia.org/T227843 [05:01:04] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Marostegui) 05Open→03Resolved All good - thanks @Papaul! ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I Port Name:... [05:02:53] 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Marostegui) 05Open→03Resolved All good - thanks! ` root@es2003:/usr/local/lib/nagios/plugins# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [05:05:26] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2045, host will be decommissioned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524081 (https://phabricator.wikimedia.org/T228281) [05:06:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2045, host will be decommissioned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524081 (https://phabricator.wikimedia.org/T228281) (owner: 10Marostegui) [05:07:13] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2045, host will be decommissioned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524081 (https://phabricator.wikimedia.org/T228281) (owner: 10Marostegui) [05:07:29] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2045, host will be decommissioned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524081 (https://phabricator.wikimedia.org/T228281) (owner: 10Marostegui) [05:08:38] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2045 from config, will be decommissioned T228281 (duration: 00m 56s) [05:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:46] T228281: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 [05:09:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2045 from config, will be decommissioned T228281 (duration: 00m 54s) [05:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:27] (03PS1) 10Marostegui: mariadb: Set db2045 to spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/524082 (https://phabricator.wikimedia.org/T228281) [05:12:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db2045 to spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/524082 (https://phabricator.wikimedia.org/T228281) (owner: 10Marostegui) [05:16:12] !log Disable notifications on db2045 T228281 [05:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:22] T228281: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 [05:18:49] !log Remove db2045 from tendril and zarcillo T228281 [05:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:23] !log Stop MySQL on db2045, host will be decommissioned T228281 [05:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:30] T228281: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 [05:25:51] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:42:53] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:47:50] (03PS1) 10Legoktm: Enable SecureLinkFixer in beta cluster (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524085 (https://phabricator.wikimedia.org/T228374) [05:47:52] (03PS1) 10Legoktm: Enable SecureLinkFixer in beta cluster (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524086 (https://phabricator.wikimedia.org/T228374) [05:58:52] (03PS1) 10MaxSem: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) [06:14:20] (03CR) 10Smalyshev: [C: 03+1] Revert "Revert "[cirrus] Use correct factory declaration for EntityFullTextQueryBuilder"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522092 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse) [06:15:42] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Nikerabbit) Per https://integration.wikimedia.org/ci/job/tr... [06:16:42] (03PS1) 10Vgutierrez: nc_redirects.dat: Reenable rules for non-canonical wikipedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/524092 (https://phabricator.wikimedia.org/T133548) [06:22:38] (03PS1) 10Vgutierrez: Point several wikipedia non-canonical domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/524093 (https://phabricator.wikimedia.org/T133548) [06:28:13] (03CR) 10Vgutierrez: [C: 03+2] nc_redirects.dat: Reenable rules for non-canonical wikipedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/524092 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [06:28:57] PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:29:11] PROBLEM - puppet last run on rpki2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:31:07] (03PS1) 10Elukey: profile::statistics::gpu: upgrade to ROCm 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/524095 (https://phabricator.wikimedia.org/T148843) [06:33:08] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: upgrade to ROCm 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/524095 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [06:33:21] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:34:48] (03CR) 10Vgutierrez: [C: 03+2] Point several wikipedia non-canonical domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/524093 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [06:36:10] (03PS1) 10Elukey: amd_rocm: add support for ROCm 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/524096 [06:37:25] (03CR) 10Elukey: [C: 03+2] amd_rocm: add support for ROCm 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/524096 (owner: 10Elukey) [06:41:49] (03PS5) 10Jeena Huneidi: Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) [06:46:30] (03PS5) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) [06:50:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/523914 (owner: 10Jbond) [06:55:55] PROBLEM - puppet last run on dns2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:55:57] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:56:56] !log deleting zerowiki elastic indices (eqiad and codfw) T227718 [06:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:07] T227718: Delete search indices for now-deleted zerowiki from production, if appropriate - https://phabricator.wikimedia.org/T227718 [06:57:15] RECOVERY - puppet last run on analytics1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:57:29] RECOVERY - puppet last run on rpki2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [07:04:49] (03PS1) 10Elukey: aptrepo: add more packages to the amd-rocm's whitelists [puppet] - 10https://gerrit.wikimedia.org/r/524160 [07:05:42] (03CR) 10Elukey: [C: 03+2] aptrepo: add more packages to the amd-rocm's whitelists [puppet] - 10https://gerrit.wikimedia.org/r/524160 (owner: 10Elukey) [07:24:11] RECOVERY - puppet last run on dns2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [07:26:35] (03PS1) 10Muehlenhoff: Decommission old jessie-based ORES pool counters [puppet] - 10https://gerrit.wikimedia.org/r/524162 (https://phabricator.wikimedia.org/T227640) [07:29:36] (03PS1) 10DCausse: Revert "[cirrus] switch search traffic (except completion) to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524163 [07:34:26] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Restarted from a clean state as indicated by upstream, and ten... [07:37:36] (03PS2) 10Marostegui: mariadb: WIP Provision dbproxy2001 into m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 [07:51:04] ACKNOWLEDGEMENT - DPKG on contint1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages amusso Zuul / python oddity https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [08:00:29] ^^^ I will fix it later this morning [08:00:30] :) [08:03:20] (03PS1) 10Elukey: profile::hive::client: add --verbose=true to default parameters [puppet] - 10https://gerrit.wikimedia.org/r/524168 (https://phabricator.wikimedia.org/T136858) [08:03:29] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [08:03:51] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) p:05Normal→03High >>! In T228196#5342893, @t... [08:05:07] (03PS3) 10Marostegui: mariadb: Provision dbproxy2001 into codfw m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) [08:06:08] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision dbproxy2001 into codfw m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:06:45] (03CR) 10Elukey: [C: 03+2] profile::hive::client: add --verbose=true to default parameters [puppet] - 10https://gerrit.wikimedia.org/r/524168 (https://phabricator.wikimedia.org/T136858) (owner: 10Elukey) [08:07:58] (03PS4) 10Marostegui: mariadb: Provision dbproxy2001 into codfw m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) [08:11:23] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs2006. [puppet] - 10https://gerrit.wikimedia.org/r/523869 (https://phabricator.wikimedia.org/T228122) [08:13:06] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs2006. [puppet] - 10https://gerrit.wikimedia.org/r/523869 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:14:57] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [08:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:10] (03PS5) 10Marostegui: mariadb: Provision dbproxy2001 into codfw m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) [08:17:30] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1001/17452/" [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:17:44] (03PS6) 10Marostegui: mariadb: Provision dbproxy2001 into codfw m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) [08:18:39] !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init [08:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] Decommission old jessie-based ORES pool counters [puppet] - 10https://gerrit.wikimedia.org/r/524162 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [08:29:54] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10media-storage, 10Wikimedia-production-error: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10fgiunchedi) p:05Unbreak!→03Normal The errors from Upload... [08:31:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] standard: remove has_admin global variable [puppet] - 10https://gerrit.wikimedia.org/r/523914 (owner: 10Jbond) [08:31:45] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [08:31:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks for this, but I was referring specifically to the tests/ directory, not the Rspec tests. I think the Rspec ones are still useful" [puppet] - 10https://gerrit.wikimedia.org/r/524043 (owner: 10Dzahn) [08:33:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Yup, duplicate contact groups are not hurting icinga (we indeed have them all over the place)." [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris) [08:34:04] (03PS2) 10Alexandros Kosiaris: Don't page on mgmt failures [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) [08:39:48] 10Operations, 10Goal: TEC6: Database Automation - https://phabricator.wikimedia.org/T220395 (10Volans) p:05Triage→03Normal [08:40:10] akosiaris: nice! re: don't page on mgmt [08:40:27] (03PS3) 10Filippo Giunchedi: wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) [08:40:42] godog: yw [08:41:53] RECOVERY - DPKG on contint1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [08:43:34] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 (owner: 10Eevans) [08:44:01] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] "> Also: These deployments are meant to be self-service aren't they? Should I have +2 on this repository?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 (owner: 10Eevans) [08:44:48] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [08:47:46] (03PS1) 10Muehlenhoff: Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 [08:48:25] (03PS7) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [08:51:00] (03PS1) 10Marostegui: db-codfw.php: Depool db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524173 (https://phabricator.wikimedia.org/T226851) [08:52:01] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524173 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [08:53:00] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524173 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [08:53:16] (03CR) 10jenkins-bot: db-codfw.php: Depool db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524173 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [08:54:08] 10Operations, 10Traffic: Provide prometheus metrics for the ncredir service - https://phabricator.wikimedia.org/T228382 (10Vgutierrez) [08:54:18] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2116 (duration: 00m 55s) [08:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:34] (03PS1) 10Hashar: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) [08:56:04] !log Drop afl_log_id column from enwiki.abuse_filter_log on db2116 T226851 [08:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:12] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [08:57:13] !log contint1001: stopped zuul, ran apt install to get the new python2.7 copied to Zuul virtualenv, restarted zuul/zuul-merger. That clears a couple Icinga alarms from yesterday [08:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:22] (03CR) 10Filippo Giunchedi: Add an anycast endpoint to syslog centralservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [08:57:41] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [09:03:16] anomie: happy to try again today to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/493323 [09:03:56] !log reuploding missing layers T228196 [09:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:05] T228196: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 [09:06:00] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) i've uploaded the missing layers from a backup, it w... [09:07:36] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) fixes also docker-registry.wikimedia.org/releng/comp... [09:09:08] !log resume swift ms-be rolling restarts - T225713 [09:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:15] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [09:10:07] (03CR) 10Marostegui: [C: 03+1] Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [09:11:15] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10media-storage, 10Wikimedia-production-error: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10zeljkofilipin) @fgiunchedi If this is not blocking the train... [09:12:31] (03CR) 10Elukey: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17454/" [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [09:22:17] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) 05Open→03Resolved [09:22:24] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [09:24:49] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce calico workaround for the iptables backend in buster [puppet] - 10https://gerrit.wikimedia.org/r/524175 (https://phabricator.wikimedia.org/T228267) [09:26:24] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [09:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:30] (03PS1) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) [09:27:41] PROBLEM - High lag on wdqs2006 is CRITICAL: 4373 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:27:57] (03CR) 10jerkins-bot: [V: 04-1] fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [09:28:26] ACKNOWLEDGEMENT - High lag on wdqs2006 is CRITICAL: 4373 ge 3600 Gehel catching up after data reload - https://phabricator.wikimedia.org/T228122 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:29:28] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs1007. [puppet] - 10https://gerrit.wikimedia.org/r/523870 (https://phabricator.wikimedia.org/T228122) [09:29:40] (03PS1) 10Ema: tlsproxy::instance: allow overriding ssl compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/524177 (https://phabricator.wikimedia.org/T227860) [09:30:36] nice --^ [09:31:09] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs1007. [puppet] - 10https://gerrit.wikimedia.org/r/523870 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [09:31:24] (03PS2) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) [09:33:05] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [09:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:19] (03PS6) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) [09:34:32] 10Operations, 10Operations-Software-Development, 10netops, 10Goal: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10Volans) p:05Triage→03Normal [09:36:33] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/17455/" [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [09:36:38] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [09:37:25] (03PS8) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [09:39:23] (03CR) 10Ema: [C: 03+1] "LGTM, one nit!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [09:39:34] (03CR) 10Muehlenhoff: "One thing we could do is to also store the base_dn and the users and groups container under common.yaml, the attribute used for authentica" [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [09:40:46] (03CR) 10Ema: "pcc seems sane https://puppet-compiler.wmflabs.org/compiler1002/17456/" [puppet] - 10https://gerrit.wikimedia.org/r/524177 (https://phabricator.wikimedia.org/T227860) (owner: 10Ema) [09:42:46] (03CR) 10Elukey: [C: 03+1] "> One thing we could do is to also store the base_dn and the users" [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [09:46:08] 10Operations, 10Operations-Software-Development, 10netops, 10Goal: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10Volans) > [stretch] Evaluate Netbox to store network secrets After playing a bit with secrets in our Netbox test box I've come to the conclusio... [09:47:44] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17457/" [puppet] - 10https://gerrit.wikimedia.org/r/523955 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [09:54:16] (03PS4) 10Jbond: varnishmtail: use -logs /dev/stdin instead of -logfds 0 [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) [09:54:49] (03CR) 10Jbond: [C: 03+2] varnishmtail: use -logs /dev/stdin instead of -logfds 0 [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) (owner: 10Jbond) [09:56:20] (03PS2) 10Muehlenhoff: Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 [09:57:13] (03CR) 10jerkins-bot: [V: 04-1] Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [09:58:01] (03PS3) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) [09:59:03] (03CR) 10Vgutierrez: fifo_log_demux: Provide pipe creation capabilities (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524176 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [10:00:26] (03PS3) 10Muehlenhoff: Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 [10:01:18] (03CR) 10jerkins-bot: [V: 04-1] Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [10:02:17] RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 893.9 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:06:14] 10Operations, 10Puppet: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) [10:10:28] (03PS1) 10Hashar: zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180 [10:12:31] (03PS4) 10Muehlenhoff: Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 [10:15:12] !log Disable puppet on services_proxy hosts - T228063 [10:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:20] T228063: Intermittent connect timeout for CirrusSearch connections - https://phabricator.wikimedia.org/T228063 [10:17:28] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] hieradata: Set connect_timeout for cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/523955 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [10:17:41] (03PS2) 10Effie Mouzeli: hieradata: Set connect_timeout for cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/523955 (https://phabricator.wikimedia.org/T228063) [10:17:49] (03PS1) 10Filippo Giunchedi: puppetmaster: blacklist per-host catalogs metrics [puppet] - 10https://gerrit.wikimedia.org/r/524182 (https://phabricator.wikimedia.org/T228395) [10:20:00] (03PS2) 10Jbond: gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988 (owner: 10Cwhite) [10:21:55] (03CR) 10Jbond: "LGTM - i also fixed the few unrelated rubocop offences" [puppet] - 10https://gerrit.wikimedia.org/r/523988 (owner: 10Cwhite) [10:26:16] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [10:26:55] PROBLEM - Host ms-be2038 is DOWN: PING CRITICAL - Packet loss = 100% [10:27:39] !log gehel@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [10:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:28] ugh ms-be2038 should have been downtimed, anywways expected [10:28:51] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17459/" [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [10:29:01] RECOVERY - Host ms-be2038 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [10:29:08] 10Operations, 10Puppet, 10Patch-For-Review: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10jbond) below are all the hosts with puppetdb package installed ` af-puppetdb[01-02].automation-framework.eqiad.wmflabs,compiler[1001-1002].puppet-diffs.eqiad.wmflabs jeh-p... [10:29:12] !log reboot wezen.codfw.wmnet - T225713 [10:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:20] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [10:30:02] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) [10:30:23] PROBLEM - Host wezen is DOWN: PING CRITICAL - Packet loss = 100% [10:31:11] RECOVERY - Host wezen is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [10:32:15] there should be rsyslog delivery failures alerts coming in too [10:33:16] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [10:34:43] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/523988 (owner: 10Cwhite) [10:35:17] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [10:36:49] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [10:36:54] (03CR) 10Jbond: [C: 03+2] standard: remove has_admin global variable [puppet] - 10https://gerrit.wikimedia.org/r/523914 (owner: 10Jbond) [10:37:02] (03PS2) 10Jbond: standard: remove has_admin global variable [puppet] - 10https://gerrit.wikimedia.org/r/523914 [10:37:49] !log enable puppet on services_proxy hosts - T228063 [10:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:58] T228063: Intermittent connect timeout for CirrusSearch connections - https://phabricator.wikimedia.org/T228063 [10:41:24] 10Operations, 10vm-requests: Site: (QUANTITY) VM %request for SERVICE[S] - https://phabricator.wikimedia.org/T228403 (10MoritzMuehlenhoff) [10:41:35] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[fetch_mediawiki] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [10:41:48] 10Operations, 10vm-requests: Site: (QUANTITY) VM %request for SERVICE[S] - https://phabricator.wikimedia.org/T228403 (10MoritzMuehlenhoff) p:05Triage→03Normal a:03MoritzMuehlenhoff [10:42:07] (03PS1) 10Elukey: Add TLS .crt file for Analytics UIs backend services [puppet] - 10https://gerrit.wikimedia.org/r/524184 (https://phabricator.wikimedia.org/T227860) [10:42:27] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [10:42:32] (03CR) 10Jbond: [C: 03+1] Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [10:43:07] 10Operations, 10vm-requests: eqiad: One VM request for identity provider - https://phabricator.wikimedia.org/T228403 (10MoritzMuehlenhoff) [10:43:21] !log cp-eqiad: varnish frontend rolling restarts for 5.1.3-1wm11 upgrades T227672 [10:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:28] T227672: Upgrade Varnish to 5.1.3-1wm11 - https://phabricator.wikimedia.org/T227672 [10:44:17] moritzm: so lookup handles hash merge transparently? [10:44:45] ema: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/524184 :) [10:44:48] elukey: the third parameter specifies the merge type [10:44:54] ah! [10:44:59] (03PS1) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [10:45:03] (03CR) 10Filippo Giunchedi: "What happens to traffic when we're turning up a new syslog server in the same site ? Case in point like now we're replacing lithium with c" [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [10:45:09] (03PS2) 10Elukey: Add TLS .crt file for Analytics UIs backend services [puppet] - 10https://gerrit.wikimedia.org/r/524184 (https://phabricator.wikimedia.org/T227860) [10:45:31] elukey: it can also be added to hiera https://puppet.com/docs/puppet/5.0/hiera_merging.html#lookupoptions-format. [10:45:43] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [10:46:10] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10fgiunchedi) [10:46:19] 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) 05Open→03Invalid Indeed the solution is to change the syslog cname. [10:46:20] elukey: yeah, see https://puppet.com/docs/puppet/5.2/hiera_use_function.html (Merge Behaviors) [10:46:21] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10fgiunchedi) [10:46:36] (03CR) 10Ema: [C: 03+1] "Looks pretty!" [puppet] - 10https://gerrit.wikimedia.org/r/524184 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [10:47:04] (03PS2) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [10:47:22] (03CR) 10Elukey: restbase: add TLS support via tlsproxy::localssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:47:30] (03PS1) 10Alexandros Kosiaris: kubernetes: Switch to using systemd cgroupdriver [puppet] - 10https://gerrit.wikimedia.org/r/524186 [10:47:37] (03CR) 10Elukey: [C: 03+2] Add TLS .crt file for Analytics UIs backend services [puppet] - 10https://gerrit.wikimedia.org/r/524184 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [10:48:21] (03PS3) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [10:51:06] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce calico workaround for the iptables backend in buster [puppet] - 10https://gerrit.wikimedia.org/r/524175 (https://phabricator.wikimedia.org/T228267) [10:51:53] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [10:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:18] PROBLEM - High lag on wdqs1010 is CRITICAL: 4756 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:52:58] PROBLEM - High lag on wdqs1007 is CRITICAL: 4796 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:53:21] (03PS4) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [10:53:25] Lag above can be ignored, will silence them when back from lunch [10:54:29] (03PS2) 10DCausse: Revert "[cirrus] switch search traffic (except completion) to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524163 [10:54:44] (03CR) 10Elukey: [C: 03+1] tlsproxy::instance: allow overriding ssl compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/524177 (https://phabricator.wikimedia.org/T227860) (owner: 10Ema) [10:56:03] (03CR) 10Elukey: [C: 03+1] tlsproxy::instance: allow overriding ssl compatibility mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524177 (https://phabricator.wikimedia.org/T227860) (owner: 10Ema) [10:59:21] (03PS5) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [11:00:04] Amir1, Lucas_WMDE, and awight: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T1100). [11:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:28] o/ [11:01:04] I guess I can swat [11:01:20] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524163 (owner: 10DCausse) [11:01:46] * awight applies oil can to rusty joints [11:02:11] !log swift eqiad-prod: put back ms-be1043 sdk1 - T218544 [11:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:19] T218544: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 [11:02:31] (03Merged) 10jenkins-bot: Revert "[cirrus] switch search traffic (except completion) to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524163 (owner: 10DCausse) [11:02:47] (03CR) 10jenkins-bot: Revert "[cirrus] switch search traffic (except completion) to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524163 (owner: 10DCausse) [11:04:34] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10fgiunchedi) Drive by comment that occurred to me yesterday while looking into {T227668}, during the transition period we'll have to adjust dashboards to account fo... [11:05:30] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce calico workaround for the iptables backend in buster [puppet] - 10https://gerrit.wikimedia.org/r/524175 (https://phabricator.wikimedia.org/T228267) [11:08:14] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] "I 've just added mediawiki-services to https://gerrit.wikimedia.org/r/#/admin/projects/operations/deployment-charts,access. You should be " [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 (owner: 10Eevans) [11:08:31] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert [cirrus] switch search traffic (except completion) to codfw (duration: 00m 56s) [11:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: introduce calico workaround for the iptables backend in buster [puppet] - 10https://gerrit.wikimedia.org/r/524175 (https://phabricator.wikimedia.org/T228267) (owner: 10Arturo Borrero Gonzalez) [11:09:42] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10fgiunchedi) >>! In T184086#5340821, @hashar wrote: > **`metrics-reporter-jmx`** > > The code is... [11:09:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [11:13:21] !log EU Swat done [11:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:09] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10media-storage, 10Wikimedia-production-error: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10fgiunchedi) [11:14:40] (03CR) 10Elukey: [C: 03+1] "Looks great, I'll follow up and use the new parameters in profile::hue" [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [11:15:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] Blubberoid: enable policy, bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 (owner: 10Thcipriani) [11:15:09] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] Blubberoid: enable policy, bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 (owner: 10Thcipriani) [11:18:34] 10Operations, 10Wikimedia-production-error (Shared Build Failure): Everything fails with unable to load the docker file - https://phabricator.wikimedia.org/T227833 (10akosiaris) Is this still ongoing? Do I understand correctly that the change to Resolved status was erroneuous? Should we reopen? Related but no... [11:18:43] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) One interesting thing is that the avg latency provided by mcrouter seems... [11:23:22] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 941 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:23:41] (03PS1) 10Ema: tlsproxy::instance: use lookup() instead of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/524190 [11:24:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, untested and hard to fully test with PCC ATM. I'd recommend also setting monitoring_enabled for cp1088 as the canary host" [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [11:24:43] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::instance: use lookup() instead of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/524190 (owner: 10Ema) [11:24:51] (03CR) 10Ema: tlsproxy::instance: allow overriding ssl compatibility mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524177 (https://phabricator.wikimedia.org/T227860) (owner: 10Ema) [11:25:22] (03PS2) 10Ema: tlsproxy::instance: allow overriding ssl compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/524177 (https://phabricator.wikimedia.org/T227860) [11:27:22] (03CR) 10Ema: [C: 03+2] tlsproxy::instance: allow overriding ssl compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/524177 (https://phabricator.wikimedia.org/T227860) (owner: 10Ema) [11:27:27] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) Hi team, just to follow up on what I've let some of you know by email, I'm going to investigat... [11:28:12] (03PS5) 10Muehlenhoff: Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 [11:30:23] (03CR) 10Muehlenhoff: [C: 03+2] Extend Hiera LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/524172 (owner: 10Muehlenhoff) [11:30:48] RECOVERY - High lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 1092 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:32:03] (03PS6) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [11:33:01] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/523709 (https://phabricator.wikimedia.org/T228266) (owner: 10Jbond) [11:34:24] (03CR) 10Volans: [C: 03+1] "I didn't check the copy/paste from the other repo, but the schema.yaml LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/523943 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [11:37:19] (03PS7) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [11:38:14] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [11:38:50] (03PS8) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [11:42:46] (03CR) 10Jbond: [C: 03+2] puppet_compiler: Add checks for missing facts files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/523709 (https://phabricator.wikimedia.org/T228266) (owner: 10Jbond) [11:44:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments inline." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [11:56:59] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: calico_workaround: convert from profile to module [puppet] - 10https://gerrit.wikimedia.org/r/524194 [11:57:01] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: kubeadm: reorganize and cleanup module code [puppet] - 10https://gerrit.wikimedia.org/r/524195 [11:58:53] (03CR) 10Mvolz: [C: 03+1] "I've added a few properties on beta, could use more of course: https://wikidata.beta.wmflabs.org/wiki/MediaWiki:Citoid-wikibase-config.jso" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (owner: 10Lucas Werkmeister (WMDE)) [11:59:03] (03PS2) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [12:00:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: calico_workaround: convert from profile to module [puppet] - 10https://gerrit.wikimedia.org/r/524194 (owner: 10Arturo Borrero Gonzalez) [12:00:30] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10jbond) [12:00:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: kubeadm: reorganize and cleanup module code [puppet] - 10https://gerrit.wikimedia.org/r/524195 (owner: 10Arturo Borrero Gonzalez) [12:03:31] (03PS1) 10Jbond: puppetmasters: add new puppetmaster1003 to puppetmasters config [puppet] - 10https://gerrit.wikimedia.org/r/524197 (https://phabricator.wikimedia.org/T201342) [12:07:30] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:16:08] (03PS2) 10Lucas Werkmeister (WMDE): Define settings for Citoid+Wikibase integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523139 (https://phabricator.wikimedia.org/T228414) [12:16:10] (03PS2) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['enableRefTabs'] in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523140 (https://phabricator.wikimedia.org/T228414) [12:16:12] (03PS2) 10Lucas Werkmeister (WMDE): Configure Citoid+Wikibase integration on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (https://phabricator.wikimedia.org/T228411) [12:16:37] 10Operations, 10vm-requests: eqiad: One VM request for identity provider - https://phabricator.wikimedia.org/T228403 (10akosiaris) LGTM [12:17:42] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10BBlack) I like the end result here, and I don't think it's problematic from the #Traffic perspective in the long view, but I think the... [12:21:32] (03PS1) 10Muehlenhoff: Switch profile::openldap::management to obtain the LDAP server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/524201 (https://phabricator.wikimedia.org/T46722) [12:24:23] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10BBlack) Oh one more thing that should've been (3) on that list: I'm pretty sure UAs cache 301s "Permanently" as indicated, so there's... [12:26:45] 10Operations, 10User-Elukey: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 (10elukey) Created a pull request https://github.com/bmatheny/memkeys/pull/26/commits From my tests it fixes the issue, but I am not sure how reactive upstream is in these days. Since there is currentl... [12:26:53] 10Operations, 10User-Elukey: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 (10elukey) p:05Triage→03Normal [12:26:59] !log add mtail 3.0.0~rc24.1-1+wmf1 to stretch-wikimedia [12:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:36] RECOVERY - Check systemd state on maps1002 is OK: OK - running: The system is fully operational [12:31:31] (03CR) 10Lucas Werkmeister (WMDE): "Great, I’ll try to get this deployed tonight (updated the changes to add task IDs too)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (https://phabricator.wikimedia.org/T228411) (owner: 10Lucas Werkmeister (WMDE)) [12:33:50] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17471/" [puppet] - 10https://gerrit.wikimedia.org/r/524201 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [12:34:06] !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init [12:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:02] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs1008. [puppet] - 10https://gerrit.wikimedia.org/r/523871 (https://phabricator.wikimedia.org/T228122) [12:44:47] (03PS2) 10Ema: restbase: add TLS support via profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) [12:44:49] (03PS1) 10Ema: Add profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/524205 (https://phabricator.wikimedia.org/T210411) [12:44:57] 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) 05Invalid→03Open Reopening as I'm seeing syslog traffic from `scs-c1-eqiad.mgmt.eqiad.wmnet` towards lithium even after dns flip [12:44:59] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10fgiunchedi) [12:45:28] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs1008. [puppet] - 10https://gerrit.wikimedia.org/r/523871 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [12:45:34] (03CR) 10Arturo Borrero Gonzalez: toolforge: Enable pod security policy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [12:45:43] (03CR) 10jerkins-bot: [V: 04-1] Add profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/524205 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [12:48:03] (03PS2) 10Ema: Add profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/524205 (https://phabricator.wikimedia.org/T210411) [12:48:05] (03PS3) 10Ema: restbase: add TLS support via profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) [12:51:34] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [12:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:36] (03CR) 10Arturo Borrero Gonzalez: toolforge: Enable pod security policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [12:53:41] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [12:53:43] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [12:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:27] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10fgiunchedi) Not sure if relevant or not, but cluster wmcs also shows elevated retransmits around the same period: {F29800469} [12:54:34] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs2002. [puppet] - 10https://gerrit.wikimedia.org/r/523872 (https://phabricator.wikimedia.org/T228122) [12:55:25] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs2002. [puppet] - 10https://gerrit.wikimedia.org/r/523872 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [12:57:12] (03CR) 10Bstorm: toolforge: Enable pod security policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [12:57:46] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [12:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:13] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524182 (https://phabricator.wikimedia.org/T228395) (owner: 10Filippo Giunchedi) [13:00:04] liw: Time to snap out of that daydream and deploy MediaWiki train - European version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T1300). [13:00:39] !log rolling upgrade of mtail [13:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:44] train status: there's two blockers, waiting to see if there's any responses to them or the train-is-blocked email [13:00:49] (03CR) 10Elukey: [C: 03+1] Add profile::tlsproxy::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524205 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:00:50] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:01:10] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:02:32] gehel: o/ - is this new --^ ? [13:02:43] liw: See my comments about train blocker T228417 [13:02:44] T228417: RevisionFormatter.php: Unknown content format - https://phabricator.wikimedia.org/T228417 [13:02:49] elukey: unexpected, checking [13:02:59] ah ack, lemme know if you need help [13:03:14] (03PS4) 10Ema: restbase: add TLS support via profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) [13:03:25] stephanebisson, thanks, looking [13:03:43] elukey: probably just not enough sleep between depool and shutting down blazegraph for data copy [13:03:49] liw: I'm looking at the other Flow-related blocker now [13:03:52] stephanebisson, thank you, I'll downgrade and remove as blocker [13:04:22] elukey: thanks for the ping! [13:05:35] np! [13:05:42] stephanebisson, and thank you for quickly responding to both tickets - I'm hoping the other one is also not a real blocker (but I don't have the knowhow to decide) [13:05:51] yep, just not enough sleep, I ran depool again, but looking at my history, I did do it already [13:06:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/524197 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond) [13:07:20] PROBLEM - DPKG on cp5006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:09:08] (03PS3) 10Ema: Add profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/524205 (https://phabricator.wikimedia.org/T210411) [13:09:51] jbond42: looks like varnishmtail is broken on cp3030 and others [13:10:04] ema: looking [13:10:25] (03CR) 10Bstorm: "The goal here is simply to get PSP running on a cluster so that experimentation and design work is easier. This gives a rather blanket ca" [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [13:10:52] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 41.18 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:11:08] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 41.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:11:11] godog: I have the next hour free if you're ready for that patch. [13:11:27] the above alerts might be due to mtail ^ [13:11:30] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 34.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:11:30] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 32.49 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:11:52] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 38.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:12:12] PROBLEM - Check systemd state on cp3042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:12:32] PROBLEM - Check systemd state on cp3033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:12:38] PROBLEM - Check systemd state on cp3040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:12:47] jbond42: we might want to rollback to /var/cache/apt/archives/mtail_3.0.0~rc5-1~bpo9+1wmf1_amd64.deb [13:13:17] jbond42: can I proceed? [13:13:33] ema: i think varnismtail is now rrunning everywhere just about to check varnishmtail-backend [13:13:44] i think the restarts just had an issue [13:14:08] jbond42: varnishmtail isn't running on cp3030, for example [13:14:28] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:14:48] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:14:48] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:15:10] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:15:32] RECOVERY - Check systemd state on cp3042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:15:33] anomie: ack, I'll merge shortly [13:15:41] ok now I see that it's recovering in esams too [13:15:49] anything I can help with mtail ema / jbond42 ? [13:15:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:15:56] RECOVERY - Check systemd state on cp3040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:16:09] jbond42: no, actually it's not starting properly [13:16:12] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) 05Resolved→03Open Thanks, I can confirm the component is around and it addresses the concern of mixing up up... [13:16:18] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [13:16:19] liw: I've responded to T228406 [13:16:19] T228406: TopicListBlock.php sort order params - https://phabricator.wikimedia.org/T228406 [13:16:21] oh is it starting then dieing [13:16:24] PROBLEM - Check systemd state on cp3030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:16:27] jbond42: let me rollback [13:16:43] do you want me to do it via debdeploy [13:17:27] jbond42: sure, I've just tested the downgrade on cp3030 and varnishmtail started fine there [13:17:39] dpkg -i /var/cache/apt/archives/mtail_3.0.0~rc5-1~bpo9+1wmf1_amd64.deb [13:17:39] RECOVERY - Check systemd state on cp3030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:17:42] ok pushing the downgrade now [13:17:46] that's what I've done ^ [13:18:33] PROBLEM - Check systemd state on cp3032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:18:35] PROBLEM - Check systemd state on cp3043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:19:08] stephanebisson, thanks! [13:19:42] ema: do you knwo what switch to use to get it to 'eep your currently-installed version [13:19:49] 'keep your currently-installed version [13:20:01] moritzm: ^^ [13:20:27] PROBLEM - DPKG on cp1083 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:27] PROBLEM - DPKG on cp4028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:27] PROBLEM - DPKG on cp1089 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:29] PROBLEM - DPKG on cp3039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:29] PROBLEM - DPKG on cp3035 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:35] PROBLEM - DPKG on cp2018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:41] PROBLEM - DPKG on cp2025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:45] PROBLEM - DPKG on cp3043 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:47] PROBLEM - Check systemd state on cp1085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:51] PROBLEM - DPKG on cp4029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:55] PROBLEM - DPKG on cp2020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:55] PROBLEM - DPKG on cp5008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:57] PROBLEM - DPKG on cp1081 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:57] PROBLEM - DPKG on cp1086 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:59] PROBLEM - DPKG on cp3036 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:20:59] PROBLEM - DPKG on cp2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:01] PROBLEM - DPKG on cp1088 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:01] PROBLEM - DPKG on cp4032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:01] PROBLEM - DPKG on cp2012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:01] PROBLEM - DPKG on cp1080 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:03] PROBLEM - DPKG on cp2016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:05] PROBLEM - DPKG on cp1087 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:05] PROBLEM - DPKG on cp2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:09] PROBLEM - Check systemd state on cp3042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:09] PROBLEM - Check systemd state on cp3041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:10] woot? [13:21:11] PROBLEM - DPKG on cp5005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:11] * jbond42 oh my :( [13:21:13] PROBLEM - DPKG on cp4027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:13] PROBLEM - DPKG on cp4021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:13] PROBLEM - DPKG on cp4025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:13] PROBLEM - DPKG on cp4024 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:15] PROBLEM - DPKG on cp3045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:17] PROBLEM - DPKG on cp3044 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:17] PROBLEM - DPKG on cp3041 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:21] I'll stop ircecho [13:21:23] PROBLEM - DPKG on cp2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:23] PROBLEM - DPKG on cp2023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:21:45] RECOVERY - DPKG on cp2025 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:22:00] !log temporarily stop ircecho on icinga1001 to avoid spam [13:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:06] jbond42: no, I'm downgrading manually [13:22:09] thank godog [13:22:20] ema: i think i have successfully downgraded everything now [13:22:54] I'm looking at this to see when we're back btw https://grafana.wikimedia.org/d/000000479/frontend-traffic [13:23:11] there's some lag though, so might take 3-4 min [13:23:11] !log promoting 1.34.0-wmf.14 to group1 [13:23:15] jbond42: ok [13:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:36] ok im showing everything as started again [13:24:26] jbond42: please !log the downgrade as well [13:24:42] I'll bring back ircecho [13:24:49] !log downgrade cp servers backl to 3.0.0~rc5-1~bpo9+1 [13:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:32] godog: yes please i think the dpkg errors hace [mostly] all cleared as well now [13:25:35] (03PS1) 10Lars Wirzenius: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524212 [13:25:37] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524212 (owner: 10Lars Wirzenius) [13:25:40] (03PS1) 10Muehlenhoff: Add DNS entries for idp1001 [dns] - 10https://gerrit.wikimedia.org/r/524213 (https://phabricator.wikimedia.org/T228403) [13:25:45] I see metrics coming back on https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&from=now-1h&to=now [13:26:36] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524212 (owner: 10Lars Wirzenius) [13:26:43] indeed looks like we're basically back [13:26:52] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524212 (owner: 10Lars Wirzenius) [13:27:03] jbond42: sorry, was distracted by a DNS change, let me know if I can help with anything [13:27:18] moritzm: its ok i think we are back to a normal state now [13:27:27] just checking the last few errors on icinga [13:27:33] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 73.65 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:28:25] here's the error: https://phabricator.wikimedia.org/P8773 [13:28:35] RECOVERY - DPKG on cp5006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:28:38] I'll be afk for 10 min [13:28:44] ok, if you did rollout the new mtail with debdeploy, there's also the option to use "rollback-update spec.yaml" and the deploy that new YAML file is generates [13:28:53] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [13:29:00] it essentially installs the previous version from the apt cache, then [13:29:07] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.14 [13:29:33] moritzm: ahh cool to know. i just tried updating the version to the downgraded one which obvioulsy didn't work [13:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:01] !log liw@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.14 (duration: 00m 53s) [13:30:03] moritzm: ahh cool to know. i just tried updating the yaml file from the deploy with the jessie version to downgraded, which obvioulsy didn't work [13:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:37] ack [13:32:14] ema: thanks for the output i have added it to https://phabricator.wikimedia.org/T225604 [13:33:08] jbond42: I don't think this qualifies as a site outage so you're not gonna win any tshirts sadly :) [13:33:56] lol ahh shucks, that reminds me though i should at least have my bag by now :( [13:34:51] !log remove mtail 3.0.0~rc24.1-1+wmf1 from stretch-wikimedia [13:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:11] (03CR) 10Ottomata: "Yeah, I see both sides. I think for now I'd opt on having control over the URL via config rather than code. This will allow us to more e" [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) (owner: 10Ottomata) [13:36:21] (03PS4) 10Ottomata: Add change-prop event_service_uri and point at eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) [13:36:26] !log rebooting labstore1005.eqiad.wmnet - T224228 [13:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:17] (03CR) 10Ottomata: [C: 03+1] Switch RESTBase event production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T524055) (owner: 10Ppchelko) [13:39:13] anomie: merging [13:39:31] (03PS3) 10Filippo Giunchedi: Logstash: Use log context for the api-feature-usage channel [puppet] - 10https://gerrit.wikimedia.org/r/493323 (https://phabricator.wikimedia.org/T217162) (owner: 10Anomie) [13:39:54] (03CR) 10Filippo Giunchedi: [C: 03+2] Logstash: Use log context for the api-feature-usage channel [puppet] - 10https://gerrit.wikimedia.org/r/493323 (https://phabricator.wikimedia.org/T217162) (owner: 10Anomie) [13:43:45] (03CR) 10Hashar: [V: 03+1] "Cherry picked on the CI puppet master and I have manually cleaned all the packages." [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [13:45:04] stephanebisson, would you be able to say if "User.php: Cannot create a user with no name, no ID, and no actor ID" is worrying? [13:47:38] anomie: merged, PTAL [13:47:44] (03CR) 10Hashar: "Could not find class ::contint::packages::apt for integration-cumin.integration.eqiad.wmflabs at /etc/puppet/modules/profile/manifests/wmc" [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [13:49:00] godog: Log messages coming into logstash seem correct. Thanks. [13:50:48] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [13:50:51] anomie: np! glad it worked [13:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:23] (03CR) 10Hashar: [V: 03+1 C: 03+1] "Due to:" [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [13:53:21] !log installing php5 security updates [13:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:42] (03PS8) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [13:54:44] (03PS4) 10Hashar: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) [13:54:46] (03PS4) 10Hashar: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) [13:54:48] (03PS1) 10Hashar: contint: remove sqlite3 debian package [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735) [13:55:46] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10fgiunchedi) From a cursory look on both ms-fe2005 and ms-fe1005 with `tshark -NNnt -i any -Y 'tcp.analysis.retransmission'` it looks like burst of retransmissions when talking to thumbor hosts, and some... [13:57:06] (03CR) 10Hashar: [V: 03+1 C: 03+1] "Marked slite3 as being automatically installed:" [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [13:58:16] (03PS4) 10Ema: Add profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/524205 (https://phabricator.wikimedia.org/T210411) [13:58:25] liw: It sounds worrying... Do you have any more context? [13:59:33] (03CR) 10Ema: [C: 03+2] Add profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/524205 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:00:17] (03PS1) 10Hashar: contint: no more include ::packages::javascript by default [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) [14:04:26] stephanebisson, I'll open a task, sec [14:04:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/524213 (https://phabricator.wikimedia.org/T228403) (owner: 10Muehlenhoff) [14:05:17] (03CR) 10Hashar: [V: 03+1 C: 03+1] "cherry picked and I have garbage collected npm/nodejs from all instances with the exception of the webperformance instance." [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [14:06:51] stephanebisson, https://phabricator.wikimedia.org/T228425 [14:08:29] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10MoritzMuehlenhoff) We could try rebooting the Thumbor hosts to the kernel version with the SACK fixes, they are currently running with SACKs disabled. [14:08:53] (03PS1) 10Hashar: contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) [14:09:25] stephanebisson, I filed it as a blocker, just in case, but I'm happy to be told by someone who understands the code that it's not a blocker [14:09:40] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs2003. [puppet] - 10https://gerrit.wikimedia.org/r/523873 (https://phabricator.wikimedia.org/T228122) [14:10:11] (03CR) 10Hashar: [V: 03+1 C: 03+1] "Applied :-]]]" [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [14:10:35] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs2003. [puppet] - 10https://gerrit.wikimedia.org/r/523873 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [14:12:50] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [14:12:54] (03CR) 10Hashar: [V: 03+1 C: 03+1] "I have marked all packages as being automatically installed (apt-mark auto xxx'. Once puppet has run on all instances, an apt-get autorem" [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [14:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:13] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for idp1001 [dns] - 10https://gerrit.wikimedia.org/r/524213 (https://phabricator.wikimedia.org/T228403) (owner: 10Muehlenhoff) [14:14:11] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Papaul) [14:16:02] jijiki: objections to rebooting thumbor2003 for T228086 ? [14:16:03] T228086: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 [14:16:51] sure, let me depool first [14:17:15] !log Depool thumbor2003 for reboot [14:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:44] go ahead godog [14:18:11] (03PS1) 10Hashar: contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) [14:18:22] if I can roll reboot all of them afterwards if you want [14:18:30] kk, thanks jijiki ! I'll upgrade the kernel too [14:18:43] if it helps then yeah for sure [14:18:56] alright, ping me when it's time [14:20:08] !log reboot thumbor2003 [14:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:42] (03PS1) 10Muehlenhoff: Add netboot.cfg for idp hosts [puppet] - 10https://gerrit.wikimedia.org/r/524226 (https://phabricator.wikimedia.org/T228403) [14:23:03] (03CR) 10Hashar: [V: 03+1 C: 03+1] "'update-alternatives --remove php /srv/deployment/integration/slave-scripts/bin/php'" [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [14:24:06] (03PS1) 10Elukey: role::analytics_cluster::hadoop::ui: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/524227 (https://phabricator.wikimedia.org/T227860) [14:24:56] !log repool thumbor2003 [14:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:58] !log nuria@deploy1001 Started deploy [analytics/refinery@4f07755]: deploying v0.0.94 of refinery [14:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:18] !log nuria@deploy1001 Finished deploy [analytics/refinery@4f07755]: deploying v0.0.94 of refinery (duration: 00m 20s) [14:27:19] (03PS1) 10Elukey: Add fake yarn.wikimedia.org's TLS key [labs/private] - 10https://gerrit.wikimedia.org/r/524228 [14:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:25] 10Operations, 10ops-codfw: (OoW) wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10Papaul) This server needs to to offline for me to perform all the troubleshooting recommended by Dell. https://www.dell.com/support/article/us/en/04/qna42558/dell-poweredge... [14:27:42] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake yarn.wikimedia.org's TLS key [labs/private] - 10https://gerrit.wikimedia.org/r/524228 (owner: 10Elukey) [14:27:52] (03PS2) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) [14:28:55] !log cp hosts: apt autoremove to clean up pkgs on the fleet [14:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [14:30:14] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10akosiaris) >>! In T226236#5345617, @hashar wrote: > Thanks, I can confirm the component is around and it addresses the c... [14:31:41] (03PS3) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) [14:31:58] jijiki: if you have the time to roll-restart the remaining thumbor codfw hosts I'd appreciate it yes, if not that's fine too (the kernel is already upgraded) [14:32:05] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17475/" [puppet] - 10https://gerrit.wikimedia.org/r/524227 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [14:32:12] stephanebisson, thanks! [14:33:20] (03PS1) 10Lars Wirzenius: all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524230 [14:33:22] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524230 (owner: 10Lars Wirzenius) [14:33:33] (03PS4) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) [14:34:16] 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10akosiaris) 05Open→03Resolved a:03akosiaris I think we can resolve this, right? I am gonna be bold and resolve it, feel free to reopen if needed [14:34:18] 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10akosiaris) [14:34:27] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524230 (owner: 10Lars Wirzenius) [14:34:36] (03PS4) 10Elukey: ReportUpdater: change repo of all queries to reportupdater-queries [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T222739) (owner: 10Fdans) [14:35:27] (03CR) 10Bstorm: [C: 03+2] toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [14:36:07] (03PS5) 10Ottomata: ReportUpdater: change repo of all queries to reportupdater-queries [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T222739) (owner: 10Fdans) [14:36:22] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524230 (owner: 10Lars Wirzenius) [14:36:24] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.14 [14:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:39] PROBLEM - DPKG on analytics-tool1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:37:39] !log all wikis at 1.34.0-wmf.14 [14:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:06] (03CR) 10Ottomata: [C: 03+2] ReportUpdater: change repo of all queries to reportupdater-queries [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T222739) (owner: 10Fdans) [14:38:23] (03PS1) 10MSantos: Re-enable eqiad crons [puppet] - 10https://gerrit.wikimedia.org/r/524232 (https://phabricator.wikimedia.org/T218097) [14:38:41] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:38:57] this is me --^ [14:39:30] (03CR) 10MSantos: "Gehel, for when we are ready." [puppet] - 10https://gerrit.wikimedia.org/r/524232 (https://phabricator.wikimedia.org/T218097) (owner: 10MSantos) [14:40:55] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-full] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:41:24] (03PS2) 10Jbond: puppetmasters: add new puppetmaster1003 to puppetmasters config [puppet] - 10https://gerrit.wikimedia.org/r/524197 (https://phabricator.wikimedia.org/T201342) [14:42:46] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) 05Open→03Resolved No errors showing in log and all Hardware showing green, firmware is at version 2.6. we can resolve this task for now and reopen in case we have the issue again. [14:44:37] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:45:21] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:45:40] !log gehel@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [14:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:59] RECOVERY - DPKG on analytics-tool1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:46:06] I'll go ahead and roll-restart thumbor in codfw [14:46:31] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:46:44] !log roll-restart thumbor in codfw - T228086 [14:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:00] T228086: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 [14:47:06] (database?) servers seem to be sooo slow - i post the edit, but it takes up to two minutes to see it on rc irc feed, on special:recent changes, to see it in category where it was added to, etc... [14:48:03] PROBLEM - Check the last execution of mediawiki_job_mediawiki_tor_exit_node on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_mediawiki_tor_exit_node https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:48:37] PROBLEM - Host thumbor2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:38] Danny_B which wiki? [14:48:58] (03CR) 10Jbond: [C: 03+2] puppetmasters: add new puppetmaster1003 to puppetmasters config [puppet] - 10https://gerrit.wikimedia.org/r/524197 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond) [14:49:38] jynus: cswikisource cswikinews [14:50:02] cswiktionary cswikiquote were faster, but still not immediate as usual though [14:50:11] RECOVERY - Host thumbor2001 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [14:51:38] 10Operations, 10ops-codfw: (OoW) wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174 (10Papaul) Log is showing a lot Correctable memory error rate exceeded for DIMM_B2. We will have to take the system down and swap the memory in DIMM B2 with a DIMM from one of the decom server onsite... [14:51:53] Danny_B: and has it been happening for some time already? [14:54:22] jynus: spotted about an hour ago and it continues (nb: those wikis are not heavily active, so it should not take ages to process the saved edit) [14:54:49] that is strange, that is not done in a job [14:55:01] recent changes is updated synchronously [14:55:10] but also there is no lag on the databases [14:55:17] something is weird [15:01:25] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10ayounsi) >>! In T228086#5345545, @fgiunchedi wrote: > Not sure if relevant or not, but cluster wmcs also shows elevated retransmits around the same period: Yup, T228086#5334741 [15:01:37] Danny_B: I could verify it first, but the next I tried it showed immediatelly [15:02:02] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10fgiunchedi) Still seeing the same retransmits in codfw after kernel upgrade and roll restart of thumbor, I'll have to stall looking at this at least for the time being though [15:02:28] Danny_B: so maybe some temporary cache or something issue [15:04:18] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 (10Marostegui) [15:05:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Marostegui) [15:05:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Marostegui) [15:06:24] godog: sorry I was in a meetinh [15:06:27] meeting [15:06:43] I can do eqiad [15:07:06] jijiki: np! no worries it doesn't look like it helped at all, so we can hold off for now [15:07:12] lol [15:07:40] alright then, but we restarted all thumbor codfw, yes? [15:08:19] (03PS1) 10Jbond: puppet-merge: test puppet-merge works with puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/524244 [15:09:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:11:00] jynus: just now again on cswikinews [15:11:09] (03CR) 10Jbond: [C: 03+2] puppet-merge: test puppet-merge works with puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/524244 (owner: 10Jbond) [15:11:35] about 70 sec delay [15:12:17] RECOVERY - Check systemd state on maps1003 is OK: OK - running: The system is fully operational [15:12:38] 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10epriestley) > I'm not sure what the purpose of the cookie is... This cookie mostly supports CSRF protection for login attempts (`phsid` is "**PH**abricator **S**ession **ID**"), and prevents an attack... [15:13:22] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer) [15:14:27] jijiki: that's correct yes [15:14:35] we have many fatals on mediawiki [15:14:53] which from kibana looks like they are 'Fatal error: entire web request took longer than 200 seconds and timed out' [15:15:05] I am tryinbg to verify that this is true [15:15:26] there is a known issue with hhvm (at least, not sure about php7.2) [15:15:41] which is that when we deploy a new version of mw, there is large spike of those time out [15:15:59] (03PS1) 10Jbond: puppet-merge: remove whitespace introduced to test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/524246 [15:16:26] Danny_B: Send a bug report stating the delay of recentchanges [15:16:43] hashar: I thought the issue was load [15:16:45] that'd be T204871 [15:16:46] T204871: Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM - https://phabricator.wikimedia.org/T204871 [15:16:55] deuring deployments [15:17:29] (03CR) 10Jbond: [C: 03+2] puppet-merge: remove whitespace introduced to test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/524246 (owner: 10Jbond) [15:17:37] we had a spike of web request timeout after 60 seconds, which eventually transformed in load alarms at some point [15:17:58] now looking at https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor for the last one hour, yeah that is no more a spike [15:18:13] PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBTransactionError' with message 'Wikimedia\Rdbms\LBFactory::shutdown: transaction round 'MWCallableUpdate::do... [15:18:16] shows up as well [15:18:45] https://grafana.wikimedia.org/d/000000438/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=now-3h&to=now [15:18:47] (03CR) 10Jforrester: "You need to add to extension-list before this step so that message cache gets populated, and can't do that until it's been in prod for a w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope) [15:19:05] yup [15:19:14] and that DBTransactionError "might" be related [15:19:28] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:19:29] (03CR) 10Jforrester: Deploy TheWikipediaLibrary to beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope) [15:19:30] PROBLEM - High lag on wdqs2004 is CRITICAL: 4019 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:08] hmm [15:20:22] seems to be stuff trying to commit master changes [15:20:30] and taking too long for some reason [15:20:42] hashar: should we rollback? [15:20:44] ACKNOWLEDGEMENT - High lag on wdqs2004 is CRITICAL: 4082 ge 3600 Gehel catching up on lag after data reload - https://phabricator.wikimedia.org/T228122 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:21:11] filling a task [15:23:46] hashar: someone reported something that is probably related [15:23:59] slowdown or editing and it appearing on recentchanges [15:24:02] *of [15:24:08] filled as T228436 [15:24:09] T228436: web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges() - https://phabricator.wikimedia.org/T228436 [15:24:45] there is an "edit lag", but database lag is mostly 0 atm [15:24:57] probably related to transaction processing [15:30:52] hashar: I updated with what I saw [15:30:59] seen yeah [15:32:02] Yes, I thin we shoudl rollback until we have time to figure out what's going on. [15:32:36] based on feedback it may be happeing on g1 too, just not so massivelly [15:32:55] so +1 to Krinkle at least for g2 [15:34:30] (03PS1) 10Jforrester: extension-list: Add SecureLinkFixer and TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524251 (https://phabricator.wikimedia.org/T200751) [15:34:50] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm. Those ldap settings are defined in too many/the wrong ways so I'm happy you're looking at untangling it :)" [puppet] - 10https://gerrit.wikimedia.org/r/524201 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [15:35:15] (03CR) 10Jforrester: [C: 04-2] "Not until it won't break scap." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524251 (https://phabricator.wikimedia.org/T200751) (owner: 10Jforrester) [15:36:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524226 (https://phabricator.wikimedia.org/T228403) (owner: 10Muehlenhoff) [15:37:51] +1 [15:38:44] RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 999.7 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:39:43] jijiki: Krinkle: liw reverting due to T228436 [15:39:44] T228436: web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges() - https://phabricator.wikimedia.org/T228436 [15:40:06] than you hashar [15:40:39] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.34.0-wmf.13 # T228436 T220739 [15:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:48] T220739: 1.34.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T220739 [15:41:44] (03PS1) 10Hashar: Revert "all wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524252 (https://phabricator.wikimedia.org/T228436) [15:41:57] (03CR) 10Hashar: [C: 03+2] Revert "all wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524252 (https://phabricator.wikimedia.org/T228436) (owner: 10Hashar) [15:42:52] (03Merged) 10jenkins-bot: Revert "all wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524252 (https://phabricator.wikimedia.org/T228436) (owner: 10Hashar) [15:43:08] (03CR) 10jenkins-bot: Revert "all wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524252 (https://phabricator.wikimedia.org/T228436) (owner: 10Hashar) [15:45:51] let's celebrate things that wen't well [15:46:11] both mediawiki measures and database-level measures prevented a total overload [15:46:25] which would had lead to a larger impact issue [15:46:31] *went [15:47:08] 10Puppet, 10cloud-services-team (Kanban): Help people remember to merge labs/private git - https://phabricator.wikimedia.org/T228443 (10Andrew) [15:48:30] 10Operations, 10Wikimedia-production-error (Shared Build Failure): Everything fails with unable to load the docker file - https://phabricator.wikimedia.org/T227833 (10Jdforrester-WMF) a:03Jdforrester-WMF >>! In T227833#5345256, @akosiaris wrote: > Is this still ongoing? No. > Do I understand correctly that... [15:48:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:49:04] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) [15:49:33] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) Tagging Operations, as this might be indicative of general infra issue wi... [15:49:47] 10Operations, 10Wikimedia-production-error (Shared Build Failure): Everything fails with unable to load the docker file - https://phabricator.wikimedia.org/T227833 (10akosiaris) Good to know. Sorry for misunderstanding this! [15:49:51] (03PS1) 10Ayounsi: Add Fastnetmon to the netinsights role [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) [15:51:04] (03PS1) 10Bstorm: Revert "toolforge: Enable pod security policy" [puppet] - 10https://gerrit.wikimedia.org/r/524254 [15:51:45] (03PS2) 10Bstorm: Revert "toolforge: Enable pod security policy" [puppet] - 10https://gerrit.wikimedia.org/r/524254 [15:52:44] (03CR) 10Bstorm: [C: 03+2] Revert "toolforge: Enable pod security policy" [puppet] - 10https://gerrit.wikimedia.org/r/524254 (owner: 10Bstorm) [15:54:09] (03PS1) 10Elukey: role::analytics_cluster::superset: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/524255 (https://phabricator.wikimedia.org/T227860) [15:54:30] !log depool ms-fe2005 - T228196 [15:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:37] T228196: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 [15:54:51] (03PS1) 10Nuria: Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) [15:54:53] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::superset: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/524255 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [15:55:23] (03PS2) 10Ayounsi: Add Fastnetmon to the netinsights role [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) [15:55:38] (03CR) 10jerkins-bot: [V: 04-1] Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [15:56:17] (03PS2) 10Muehlenhoff: Add netboot.cfg for idp hosts [puppet] - 10https://gerrit.wikimedia.org/r/524226 (https://phabricator.wikimedia.org/T228403) [15:58:11] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/17477/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [15:58:19] (03CR) 10Muehlenhoff: [C: 03+2] Add netboot.cfg for idp hosts [puppet] - 10https://gerrit.wikimedia.org/r/524226 (https://phabricator.wikimedia.org/T228403) (owner: 10Muehlenhoff) [15:58:36] (03PS2) 10Nuria: Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) [15:59:08] (03PS2) 10Elukey: role::analytics_cluster::superset: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/524255 (https://phabricator.wikimedia.org/T227860) [15:59:35] (03CR) 10jerkins-bot: [V: 04-1] Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [16:00:04] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:58] RECOVERY - Check the last execution of mediawiki_job_mediawiki_tor_exit_node on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_mediawiki_tor_exit_node https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:00:58] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [16:02:10] (03PS2) 10Muehlenhoff: Switch profile::openldap::management to obtain the LDAP server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/524201 (https://phabricator.wikimedia.org/T46722) [16:03:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch profile::openldap::management to obtain the LDAP server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/524201 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [16:05:16] !log add routinator 0.5.0 to APT [16:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:21] Landing https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/523827/ for deployment to fix another wmf.14 regression [16:06:32] !log upgrade Routinator to 0.5.0 in codfw - T220669 [16:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:40] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [16:07:32] 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) Starting today at `12:48` until now, lithium has seen a variety of traffic towards ports 514 and 10514 from these devices, not sure if dns for syslog is never lo... [16:07:33] hashar, ack [16:09:46] (03PS1) 10Elukey: role::analytics_cluster::webserver: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/524258 (https://phabricator.wikimedia.org/T227860) [16:09:48] (03PS1) 10Elukey: role::analytics_cluster::turnilo: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/524259 (https://phabricator.wikimedia.org/T227860) [16:09:53] heading out for dinner etc [16:15:21] (03PS1) 10Andrew Bogott: no-op testing patch [labs/private] - 10https://gerrit.wikimedia.org/r/524261 [16:15:34] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] no-op testing patch [labs/private] - 10https://gerrit.wikimedia.org/r/524261 (owner: 10Andrew Bogott) [16:15:54] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10MSantos) >>! In T223275#5187350, @Mathew.onipe wrote: > We can use the main repo instead of using the package re... [16:17:35] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [16:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:26] (03PS1) 10Ottomata: Use eventgate-wikimedia image for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/524263 (https://phabricator.wikimedia.org/T226668) [16:20:35] * Krinkle staging on mwdebug1002 [16:24:10] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.14/resources/src/mediawiki.misc-authed-ooui/special.movePage.js: e97a284dbe54 (duration: 00m 58s) [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:36] (03PS1) 10Andrew Bogott: puppet-merge: re-run with --labsprivate if there's nothing to merge in prod [puppet] - 10https://gerrit.wikimedia.org/r/524265 (https://phabricator.wikimedia.org/T228443) [16:25:38] (03PS1) 10Andrew Bogott: puppet-merge: clarify the distiction between local and remote runs [puppet] - 10https://gerrit.wikimedia.org/r/524266 [16:25:40] (03PS1) 10Andrew Bogott: puppet-merge: don't merge conftool after doing a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524267 [16:29:34] !log upgrade Routinator to 0.5.0 in eqiad - T220669 [16:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:42] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [16:29:54] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10jijiki) [16:30:53] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10cchen) [16:36:31] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [16:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:39] (03PS1) 10Andrew Bogott: Add monitoring for unmerged patches in the /labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/524268 (https://phabricator.wikimedia.org/T228443) [16:39:20] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) >>! From IRC: > [2019-07-17 15:02:…] nc_redis.c... [16:40:53] (03CR) 10Jeena Huneidi: Add mediawiki development chart. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [16:41:40] (03CR) 10Andrew Bogott: "pcc diff for this patchset: https://puppet-compiler.wmflabs.org/compiler1002/17479/puppetmaster1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/524267 (owner: 10Andrew Bogott) [16:43:20] 10Operations, 10Puppet, 10Release-Engineering-Team-TODO, 10puppet-compiler, 10Release-Engineering-Team (CI & Testing services): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10greg) [16:44:40] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Milimetric) p:05Triage→03High [16:55:06] (03CR) 10Ori.livneh: "Someone in SRE just needs to pull the trigger on this" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [16:55:40] (03PS3) 10Ottomata: Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [16:56:34] (03CR) 10jerkins-bot: [V: 04-1] Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [16:56:51] (03PS1) 10Jbond: puppetmaster: add type checking ro puppetmaster::web_frontend [puppet] - 10https://gerrit.wikimedia.org/r/524274 [16:58:11] (03CR) 10Nuria: [C: 03+1] Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [16:59:55] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T1700). [17:00:18] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) Re-blocking the train. Turns out that while wmf.13 is also affected, it is... [17:00:21] no parsoid deploy today [17:00:42] (03PS1) 10Muehlenhoff: Add DHCP config for idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/524275 (https://phabricator.wikimedia.org/T228403) [17:01:16] (03CR) 10jerkins-bot: [V: 04-1] Add DHCP config for idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/524275 (https://phabricator.wikimedia.org/T228403) (owner: 10Muehlenhoff) [17:02:46] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10greg) p:05Triage→03Unbreak! [17:06:25] (03PS4) 10Ottomata: Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [17:06:53] (03PS2) 10Muehlenhoff: Add DHCP config for idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/524275 (https://phabricator.wikimedia.org/T228403) [17:07:25] (03CR) 10CDanis: puppet-merge: re-run with --labsprivate if there's nothing to merge in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524265 (https://phabricator.wikimedia.org/T228443) (owner: 10Andrew Bogott) [17:07:27] (03CR) 10jerkins-bot: [V: 04-1] Add DHCP config for idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/524275 (https://phabricator.wikimedia.org/T228403) (owner: 10Muehlenhoff) [17:08:08] (03PS1) 10Ayounsi: Make Icinga alert on Grafana RPKI alerts [puppet] - 10https://gerrit.wikimedia.org/r/524277 (https://phabricator.wikimedia.org/T220669) [17:08:44] (03PS3) 10Muehlenhoff: Add DHCP config for idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/524275 (https://phabricator.wikimedia.org/T228403) [17:11:20] (03PS9) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:11:26] (03CR) 10CDanis: [C: 03+1] "This does introduce a subtle functional change, which is that it used to be the case if a user called puppet-merge with an explicit sha1, " [puppet] - 10https://gerrit.wikimedia.org/r/524266 (owner: 10Andrew Bogott) [17:11:33] (03CR) 10CDanis: [C: 03+1] puppet-merge: don't merge conftool after doing a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524267 (owner: 10Andrew Bogott) [17:11:59] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [17:13:11] (03PS10) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:14:06] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [17:15:04] !log krinkle@depoy1001: Pull down https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralAuth/+/523844/ and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralAuth/+/524276/ (no-op, not deploying) [17:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:31] (03PS11) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:17:26] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [17:17:34] FFS :) [17:24:22] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10elukey) I have isolated on logstash mw1345 and it seems that the nutcracker errors h... [17:25:43] (03PS1) 10Bstorm: toolforge: include the kubeadm_docker_service [puppet] - 10https://gerrit.wikimedia.org/r/524281 (https://phabricator.wikimedia.org/T215531) [17:25:46] Krinkle: --^ [17:26:10] (03PS12) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:26:13] (03CR) 10Cwhite: [C: 03+2] gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988 (owner: 10Cwhite) [17:26:16] (03PS2) 10Bstorm: toolforge: include the kubeadm_docker_service [puppet] - 10https://gerrit.wikimedia.org/r/524281 (https://phabricator.wikimedia.org/T215531) [17:26:43] (03PS3) 10Cwhite: gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988 [17:26:51] (03CR) 10Dzahn: [C: 03+1] Make Icinga alert on Grafana RPKI alerts [puppet] - 10https://gerrit.wikimedia.org/r/524277 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [17:27:01] * Krinkle deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/524279/ soon to fix redis for wmf.14 and wmf.13 [17:27:11] \o/ [17:27:51] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2039 [dns] - 10https://gerrit.wikimedia.org/r/524282 [17:27:57] (03CR) 10Bstorm: [C: 03+2] toolforge: include the kubeadm_docker_service [puppet] - 10https://gerrit.wikimedia.org/r/524281 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [17:29:54] (03PS4) 10Cwhite: gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988 [17:30:45] (03PS13) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:32:45] 10Operations: Host mw2250 is not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T227547 (10Dzahn) The reason it's depooled is it had a degraded RAID (T226948) i assume. [17:33:28] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) Approved as Connie's manager, thanks! [17:33:34] vgutierrez: Did you mean file-format support? – https://github.com/wikimedia/mediawiki-extensions-Translate/tree/master/ffs [17:33:35] :P [17:33:51] ahahah [17:33:56] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) @RStallman-legalteam can you verify that Connie has signed the appropriate NDAs? [17:36:51] (03PS14) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:37:28] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [17:38:43] (03PS15) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:39:25] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) (owner: 10Vgutierrez) [17:40:37] (03PS16) 10Vgutierrez: ncredir: Use pipes instead of files for the access_log [puppet] - 10https://gerrit.wikimedia.org/r/524185 (https://phabricator.wikimedia.org/T228382) [17:41:50] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch - https://phabricator.wikimedia.org/T224585 (10bd808) Moving to Stretch would be a good time to also rename these hosts to get rid of the "lab" qualifier. These hosts seem to be running: * grafana * graphite * prometheus * statsite [17:45:32] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.14/includes/libs/objectcache/RedisBagOStuff.php: 69cd8b0f49e8caf8c7398ad76a1ce3d2da4f3e6b (duration: 00m 55s) [17:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:57] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) p:05Unbreak!→03High Prod recovered. [17:49:01] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) [17:50:42] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10RStallman-legalteam) @kzimmerman this will be similar to the last case where I'll need to create one for legal bec... [17:53:15] jouncebot: next [17:53:15] In 0 hour(s) and 6 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T1800) [17:54:13] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) Emailed; thanks Rachel! [17:54:47] (03PS1) 10Jbond: puppetmaster: Add the abbilty to have canary beckends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [17:59:20] (03PS1) 10Cwhite: hiera: canary enable varnishkafka_exporter on cp1088 [puppet] - 10https://gerrit.wikimedia.org/r/524288 (https://phabricator.wikimedia.org/T196066) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T1800). [18:00:04] RoanKattouw: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:39] My SWAT patch is canceled, sorry for forgetting to remove it [18:02:11] With that removed, there are no patches scheduled for SWAT [18:02:37] RoanKattouw: would you mind deploying to a single host ? [18:03:04] mw2250 would need the latest code but currently scap pull is broken. could you push to it? [18:11:29] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10RStallman-legalteam) Updating again as I looked into this and no special NDA is needed in this case. Fine to proce... [18:25:56] RECOVERY - DPKG on contint2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:26:27] (03CR) 10Andrew Bogott: "My thought wasn't so much that someone would run this script twice, as that they would run it to do the thing that they are trying to do :" [puppet] - 10https://gerrit.wikimedia.org/r/524265 (https://phabricator.wikimedia.org/T228443) (owner: 10Andrew Bogott) [18:27:14] ^ contint2001 - just fixed issue with zuul install there, logs were in -releng [18:27:54] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:28:03] that's the same thing [18:35:04] (03PS1) 10Ladsgroup: labs: Make $wmgUseEntitySourceBasedFederation similar to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524296 (https://phabricator.wikimedia.org/T226008) [18:35:23] (03PS1) 10CDanis: syncer: verify no data == no removal [software/conftool] - 10https://gerrit.wikimedia.org/r/524297 [18:35:41] (03PS2) 10Andrew Bogott: puppet-merge: clarify the distiction between local and remote runs [puppet] - 10https://gerrit.wikimedia.org/r/524266 [18:35:43] (03PS2) 10Andrew Bogott: puppet-merge: don't merge conftool after doing a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524267 [18:35:45] (03PS2) 10Andrew Bogott: puppet-merge: include a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524265 (https://phabricator.wikimedia.org/T228443) [18:35:47] (03CR) 10Ladsgroup: [C: 03+2] labs: Make $wmgUseEntitySourceBasedFederation similar to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524296 (https://phabricator.wikimedia.org/T226008) (owner: 10Ladsgroup) [18:36:35] (03CR) 10Andrew Bogott: "> However I think that this change actually makes the script slightly" [puppet] - 10https://gerrit.wikimedia.org/r/524266 (owner: 10Andrew Bogott) [18:37:25] (03Merged) 10jenkins-bot: labs: Make $wmgUseEntitySourceBasedFederation similar to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524296 (https://phabricator.wikimedia.org/T226008) (owner: 10Ladsgroup) [18:37:32] (03PS3) 10Andrew Bogott: puppet-merge: clarify the distiction between local and remote runs [puppet] - 10https://gerrit.wikimedia.org/r/524266 [18:38:16] ^ rebased [18:38:35] (03CR) 10jenkins-bot: labs: Make $wmgUseEntitySourceBasedFederation similar to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524296 (https://phabricator.wikimedia.org/T226008) (owner: 10Ladsgroup) [18:38:59] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: clarify the distiction between local and remote runs [puppet] - 10https://gerrit.wikimedia.org/r/524266 (owner: 10Andrew Bogott) [18:39:38] (03PS3) 10Andrew Bogott: puppet-merge: don't merge conftool after doing a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524267 [18:40:11] (03CR) 10CDanis: [C: 03+1] puppet-merge: include a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524265 (https://phabricator.wikimedia.org/T228443) (owner: 10Andrew Bogott) [18:41:00] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: don't merge conftool after doing a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524267 (owner: 10Andrew Bogott) [18:42:56] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH) [18:45:17] !log contint2001 - had puppet failure in puppet board / dpkg issue due to unfinished zuul install which was done on contint1001 - stopped zuul and zuul-merger, apt-install zuul (was already latest version but needed to finish configure step), apt-get autoremove to remove unused packages, ran puppet. dpkg and puppet happy again [18:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:23] (03PS3) 10Andrew Bogott: puppet-merge: include a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524265 (https://phabricator.wikimedia.org/T228443) [18:47:05] (03PS1) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524299 (https://phabricator.wikimedia.org/T227290) [18:47:32] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: include a --labsprivate run [puppet] - 10https://gerrit.wikimedia.org/r/524265 (https://phabricator.wikimedia.org/T228443) (owner: 10Andrew Bogott) [18:48:37] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) ran wmf-auto-reimage-host on it. OS is freshly installed though the first puppet run fails because it tries to run scap pull and this is currently broken (T228328) so this... [18:48:58] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) 05Open→03Stalled [18:50:02] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10aaron) So, I've noticed that on mw1261/mw2224 as *well* as plain old mwmaint1002,... [18:54:22] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:54:26] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:54:32] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:54:34] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:54:52] PROBLEM - WDQS HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:54:54] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:54:56] PROBLEM - WDQS HTTP Port on wdqs1008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:55:16] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:55:18] Oops, downtime expired, wdqs1008 is still depool3d, not to worry [18:56:25] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH) [18:58:22] (03PS3) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [18:58:34] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH) @fgiunchedi, Please note I've created two sub-tasks in the private space for quotation and ordering. I'll need you to fill out the task description for each w... [18:59:16] (03CR) 10jerkins-bot: [V: 04-1] WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 (owner: 10Effie Mouzeli) [19:01:07] (03CR) 10CDanis: [C: 03+1] Add monitoring for unmerged patches in the /labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/524268 (https://phabricator.wikimedia.org/T228443) (owner: 10Andrew Bogott) [19:01:19] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10aaron) Err, more PEBCAK . I put the * in the wrong spot... [19:01:43] (03PS2) 10Andrew Bogott: Add monitoring for unmerged patches in the /labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/524268 (https://phabricator.wikimedia.org/T228443) [19:02:45] (03PS2) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524299 (https://phabricator.wikimedia.org/T227290) [19:02:47] (03CR) 10Andrew Bogott: [C: 03+2] Add monitoring for unmerged patches in the /labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/524268 (https://phabricator.wikimedia.org/T228443) (owner: 10Andrew Bogott) [19:04:13] (03PS3) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524299 (https://phabricator.wikimedia.org/T227290) [19:04:33] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10mpopov) The reason for private data access is that to investigate usage of the tile service (especially by parties outside of Wikimedia and our communities, which ar... [19:05:12] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [19:07:44] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10aaron) OK, replication for SET/DELETE seems fine on mw1261/mw2224 for me and the... [19:11:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use eventgate-wikimedia image for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/524263 (https://phabricator.wikimedia.org/T226668) (owner: 10Ottomata) [19:12:34] (03PS4) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [19:13:27] (03CR) 10jerkins-bot: [V: 04-1] WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 (owner: 10Effie Mouzeli) [19:14:02] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10kzimmerman) [19:14:26] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10kzimmerman) [19:15:02] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10kzimmerman) Updated ticket to reflect need for notebook1003 & notebook1004 as well [19:15:05] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 (owner: 10Eevans) [19:15:35] (03PS4) 10Eevans: sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 [19:15:37] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 (owner: 10Eevans) [19:15:52] (03PS1) 10Ottomata: Use eventgate-wikimedia version for eventgate-{main,analytics} staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/524306 (https://phabricator.wikimedia.org/T226668) [19:18:01] (03PS5) 10Dzahn: nrpe: remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/524043 [19:18:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use eventgate-wikimedia version for eventgate-{main,analytics} staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/524306 (https://phabricator.wikimedia.org/T226668) (owner: 10Ottomata) [19:18:21] (03PS2) 10Ottomata: Use eventgate-wikimedia version for eventgate-{main,analytics} staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/524306 (https://phabricator.wikimedia.org/T226668) [19:18:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use eventgate-wikimedia version for eventgate-{main,analytics} staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/524306 (https://phabricator.wikimedia.org/T226668) (owner: 10Ottomata) [19:18:57] (03CR) 10jerkins-bot: [V: 04-1] nrpe: remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/524043 (owner: 10Dzahn) [19:20:21] !log eevans@ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:29] (03PS6) 10Dzahn: nrpe: remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/524043 [19:20:44] (03CR) 10Bstorm: [C: 03+2] toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524299 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [19:21:26] (03PS4) 10Jforrester: Revoke editmyuserjsredirect from all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [19:21:28] (03CR) 10jerkins-bot: [V: 04-1] nrpe: remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/524043 (owner: 10Dzahn) [19:21:41] (03CR) 10Jforrester: [C: 03+2] Revoke editmyuserjsredirect from all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [19:21:58] !log otto@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [19:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:42] (03Merged) 10jenkins-bot: Revoke editmyuserjsredirect from all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [19:22:57] (03CR) 10jenkins-bot: Revoke editmyuserjsredirect from all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [19:23:40] (03CR) 10Jbond: "looks good but a few suggestions" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [19:24:15] (03PS1) 10Bstorm: toolforge: remove class redeclaration [puppet] - 10https://gerrit.wikimedia.org/r/524310 (https://phabricator.wikimedia.org/T215531) [19:24:36] (03PS7) 10Dzahn: nrpe: remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/524043 [19:25:31] !log otto@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [19:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:47] (03PS2) 10Jforrester: Enable SecureLinkFixer in beta cluster (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524085 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:26:55] (03PS5) 10Ottomata: Add change-prop event_service_uri and point at eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) [19:27:20] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T207750 Revoke editmyuserjsredirect from all users (duration: 00m 54s) [19:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:30] (03CR) 10Jforrester: [C: 03+2] Enable SecureLinkFixer in beta cluster (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524085 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:28:29] (03PS5) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [19:28:53] (03Merged) 10jenkins-bot: Enable SecureLinkFixer in beta cluster (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524085 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:29:10] (03CR) 10jenkins-bot: Enable SecureLinkFixer in beta cluster (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524085 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:30:37] (03CR) 10Jforrester: [C: 03+2] Enable SecureLinkFixer in beta cluster (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524086 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:30:41] (03CR) 10Ottomata: [C: 03+2] Add change-prop event_service_uri and point at eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) (owner: 10Ottomata) [19:30:53] (03PS2) 10Jforrester: Enable SecureLinkFixer in beta cluster (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524086 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:31:04] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524086 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:31:06] (03PS4) 10Ottomata: Switch RESTBase event production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T524055) (owner: 10Ppchelko) [19:31:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Switch RESTBase event production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T524055) (owner: 10Ppchelko) [19:31:24] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T228374 Enable SecureLinkFixer in beta cluster (1/2) (duration: 00m 55s) [19:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:32] T228374: Deploy SecureLinkFixer to beta cluster - https://phabricator.wikimedia.org/T228374 [19:31:52] (03PS6) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [19:32:09] (03Merged) 10jenkins-bot: Enable SecureLinkFixer in beta cluster (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524086 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:32:56] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [19:33:19] (03CR) 10jenkins-bot: Enable SecureLinkFixer in beta cluster (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524086 (https://phabricator.wikimedia.org/T228374) (owner: 10Legoktm) [19:33:32] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T228374 Enable SecureLinkFixer in beta cluster (2/2) (duration: 00m 55s) [19:33:34] (03PS7) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [19:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:36] (03PS2) 10Bstorm: toolforge: remove class redeclaration [puppet] - 10https://gerrit.wikimedia.org/r/524310 (https://phabricator.wikimedia.org/T215531) [19:37:17] (03PS3) 10Jforrester: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic) [19:38:11] (03CR) 10Bstorm: [C: 03+2] toolforge: remove class redeclaration [puppet] - 10https://gerrit.wikimedia.org/r/524310 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [19:39:42] (03PS8) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [19:41:30] (03PS9) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [19:42:11] 10Puppet, 10cloud-services-team (Kanban): Help people remember to merge labs/private git - https://phabricator.wikimedia.org/T228443 (10Andrew) 05Open→03Resolved [19:42:13] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) [19:45:05] (03PS10) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [19:45:58] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for db2039 [dns] - 10https://gerrit.wikimedia.org/r/524282 (owner: 10Papaul) [19:47:02] thank you James_F! [19:47:06] (03CR) 10Dzahn: [C: 03+2] nrpe: remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/524043 (owner: 10Dzahn) [19:47:14] (03PS8) 10Dzahn: nrpe: remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/524043 [19:48:56] legoktm: Thank *you*. :-) [19:49:02] HSTS FTW and all that. [19:53:03] (03PS2) 10Dzahn: nrpe: add notes_url parameter to specs [puppet] - 10https://gerrit.wikimedia.org/r/521386 [19:53:52] (03CR) 10jerkins-bot: [V: 04-1] nrpe: add notes_url parameter to specs [puppet] - 10https://gerrit.wikimedia.org/r/521386 (owner: 10Dzahn) [19:55:03] (03PS12) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [19:55:09] !log eevans@ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [19:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:04] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:01:30] (03PS13) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:02:25] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:02:48] (03Abandoned) 10Dzahn: nrpe: add notes_url parameter to specs [puppet] - 10https://gerrit.wikimedia.org/r/521386 (owner: 10Dzahn) [20:04:34] (03PS14) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:05:28] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:07:10] (03PS11) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [20:08:23] (03PS15) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:09:27] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:09:45] (03PS1) 10Bstorm: cloudstore: mount dumps NFS on wcdo Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/524319 (https://phabricator.wikimedia.org/T228450) [20:10:05] 10Operations, 10Cassandra, 10Goal, 10Patch-For-Review, and 2 others: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10WDoranWMF) [20:11:08] 10Operations, 10Cassandra, 10Goal, 10Patch-For-Review, and 2 others: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10Eevans) [20:12:58] (03PS12) 10Effie Mouzeli: profile::mediawiki::jobrunner: Enable feature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [20:13:10] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10WDoranWMF) [20:13:57] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM: https://puppet-compiler.wmflabs.org/compiler1001/17486/mw1300.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/523908 (owner: 10Effie Mouzeli) [20:16:25] (03CR) 10Bstorm: [C: 03+2] cloudstore: mount dumps NFS on wcdo Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/524319 (https://phabricator.wikimedia.org/T228450) (owner: 10Bstorm) [20:19:29] (03PS16) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:20:59] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:23:28] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo) [20:26:30] (03PS5) 10Ottomata: Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [20:26:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Bumping up jar version of refine and adding transform function [puppet] - 10https://gerrit.wikimedia.org/r/524256 (https://phabricator.wikimedia.org/T227484) (owner: 10Nuria) [20:26:59] (03CR) 10Dzahn: "created https://wikitech.wikimedia.org/wiki/Monitoring/dpkg" [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:30:38] (03PS17) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:31:53] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:32:39] (03PS18) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:33:56] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:33:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10Nuria) 05Open→03Resolved [20:34:04] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) [20:35:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10Nuria) 05Open→03Resolved [20:35:34] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) [20:36:16] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10aaron) [20:36:31] (03CR) 10Cwhite: [C: 03+2] prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [20:36:38] (03PS9) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [20:36:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Nuria) 05Open→03Resolved [20:36:51] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) [20:37:00] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10aaron) 05Open→03Resolved a:03aaron [20:40:25] (03PS1) 10TheDJ: Add decimal seek offset for videos [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/524330 [20:40:56] (03CR) 10TheDJ: "untested, as testcases don't run on mac" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/524330 (owner: 10TheDJ) [20:41:20] (03PS2) 10TheDJ: Add decimal seek offset for videos [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/524330 (https://phabricator.wikimedia.org/T228467) [20:42:08] 10Operations, 10Cassandra, 10Core Platform Team (Cassandra Operational ), 10User-Eevans: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10WDoranWMF) [20:42:47] (03PS19) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:43:35] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:43:46] (03PS1) 10Jbond: puppet-merge: possible idea to add some atomic behaviour to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/524331 (https://phabricator.wikimedia.org/T221529) [20:45:41] shdubsh: ^^^ this is a start at what i was talkibg about in making puppet-merge more atomic. it still dosen;t address the issue akosia.ris described but i think it could still be an improvment [20:45:41] (03PS20) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:45:43] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17488/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/524277 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [20:46:10] (03PS2) 10Ayounsi: Make Icinga alert on Grafana RPKI alerts [puppet] - 10https://gerrit.wikimedia.org/r/524277 (https://phabricator.wikimedia.org/T220669) [20:46:12] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:48:35] (03PS2) 10Jbond: puppet-merge: possible idea to add some atomic behavior to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/524331 (https://phabricator.wikimedia.org/T221529) [20:51:24] !log gerrit2001 - apt-get upgrade; apt-get autoremove ; puppet agent -tv [20:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:33] (03PS21) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [20:56:14] !log gerrit2001 - reboot for kernel upgrade [20:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:27] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:57:02] !log gerrit2001 - icinga downtime for 1h [20:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:01] (03PS1) 10Effie Mouzeli: profile::mediawiki::jobrunner: Configure php7_only flag [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) [21:02:04] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM: https://puppet-compiler.wmflabs.org/compiler1002/17490/" [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [21:03:04] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/Flow: T228290 Fix fatal in ChangesListFormatter::getLogTextLinks() (duration: 01m 02s) [21:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:11] T228290: Fatal on Watchlist: Call to a member function getAlphadecimal() on null - https://phabricator.wikimedia.org/T228290 [21:08:51] jouncebot: next [21:08:51] In 1 hour(s) and 51 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T2300) [21:09:48] (03CR) 10Effie Mouzeli: [V: 04-1 C: 04-1] "Produces the error "Error while evaluating a Function Call, secret(): invalid secret thumbor/ban_path_re.lst" in https://puppet-compiler.w" [puppet] - 10https://gerrit.wikimedia.org/r/512486 (owner: 10Filippo Giunchedi) [21:12:31] effie: heads-up, time to grab a coffee. we want to reboot gerrit quickly, heh [21:12:42] but if you are in the middle of it you can stop us :) [21:12:44] haha [21:13:04] I think I am done :p [21:13:10] ok :) [21:13:16] tx for the heads up though! [21:13:19] it seemed relatively quiet now [21:13:23] (03CR) 10Ayounsi: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [21:15:05] (03CR) 10Jbond: "looks good a question and a nitpick" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [21:15:49] !log gerrit (cobalt) - scheduled 1h downtime, rebooting for kernel upgrade [21:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:09] jbond42: XioNoX ^ be right back [21:17:04] thx for the head's up, I was indeed on it [21:17:55] ACKNOWLEDGEMENT - MD RAID on cobalt is CRITICAL: connect to address 208.80.154.81 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T228478 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:17:58] 10Operations, 10ops-eqiad: Degraded RAID on cobalt - https://phabricator.wikimedia.org/T228478 (10ops-monitoring-bot) [21:18:16] wut.that is triggered by reboot [21:18:28] all services are in downtime [21:19:41] XioNoX: gerrit dashboard back for me [21:19:59] it's okay you can keep it down longer I'm doing something else now [21:20:00] :) [21:20:44] lol. it seemed faster to reboot the server than it takes to restart the service [21:20:50] yeah seems like it [21:20:57] there should be some timeout [21:21:20] we need to fix the RAID check [21:21:35] dont need tickets if the reason is "no route to host" [21:21:42] or not the same kind at least [21:22:29] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:22:50] 10Operations, 10serviceops, 10Core Platform Team (Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MSantos) [21:23:10] hahah https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:23:26] maybe there should be something on that page [21:24:11] 10Operations, 10ops-eqiad: Degraded RAID on cobalt - https://phabricator.wikimedia.org/T228478 (10Dzahn) 05Open→03Invalid it was just being rebooted and the error was "no route to host". not a real disk issue. [21:25:23] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:26:32] XioNoX: that's the joke :) but more seriously, already working on a change above that would remove that default value [21:26:59] notebook1003 issue is side effect of gerrit reboot [21:31:03] ACKNOWLEDGEMENT - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T226948 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:31:03] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on mw2250 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart daniel_zahn https://phabricator.wikimedia.org/T226948 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:31:03] ACKNOWLEDGEMENT - PHP7 rendering on mw2250 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://10.192.0.76:9005/w/health-check.php - 380 bytes in 0.073 second response time daniel_zahn https://phabricator.wikimedia.org/T226948 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:31:03] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2250 is CRITICAL: Host mw2250 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T226948 https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [21:31:03] ACKNOWLEDGEMENT - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[fetch_mediawiki] daniel_zahn https://phabricator.wikimedia.org/T226948 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:31:59] PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled [21:32:07] jouncebot: now [21:32:07] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [21:32:40] I'm going to push out some bug fixes for Striker [21:33:47] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:34:46] !log bd808@deploy1001 Started deploy [striker/deploy@91594df]: Fixes for deprecation warnings and editing Tool models (T228222, T228332) [21:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:56] T228222: Django deprecation warning for "{% load staticfiles %}" - https://phabricator.wikimedia.org/T228222 [21:34:56] T228332: Fatal error when adding new maintainer to a tool - https://phabricator.wikimedia.org/T228332 [21:35:47] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:35:59] !log bd808@deploy1001 Finished deploy [striker/deploy@91594df]: Fixes for deprecation warnings and editing Tool models (T228222, T228332) (duration: 01m 13s) [21:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:39] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [21:37:10] SMalyshev: should we depool wdqs2004 / do you know about work on it? it says it is "marked as down but pooled" [21:38:35] mutante: hmm not sure gehel was doing some work there [21:38:44] I can take look but in 15 mins [21:39:19] SMalyshev: ok, i already pinged him and left a comment on the ticket. actually the check might be a liar because confctl says ""pooled": "no"}" odd [21:39:23] mutante: Sorry, I missed your ping earlier. Do you still need me to resync mw2250? [21:39:40] (It should be as easy as ssh-ing into that box and running "scap pull") [21:40:18] RoanKattouw: yes, it would still be nice. i would usually use scap pull but https://phabricator.wikimedia.org/T228328 [21:40:21] (03PS1) 10Eevans: sessionstore (staging): remove downed Cassandra hsot from seeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/524352 [21:40:31] RoanKattouw: reported yesterday that scap pull stopped working .. [21:41:10] (03PS2) 10Eevans: sessionstore (staging): remove failed Cassandra host from seeds list [deployment-charts] - 10https://gerrit.wikimedia.org/r/524352 [21:41:43] (03CR) 10Eevans: [V: 03+2 C: 03+2] "Self-merging trivial config change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/524352 (owner: 10Eevans) [21:42:02] Oh whoops, it's broken because of a change I made [21:42:07] (03PS1) 10MSantos: Pass use_nodejs10 to proton [puppet] - 10https://gerrit.wikimedia.org/r/524353 (https://phabricator.wikimedia.org/T217114) [21:42:29] RoanKattouw: heh:) maybe you have comments on Tyler's change then.. at the bottom of that ticket [21:42:37] I just +1ed in [21:42:38] *it [21:42:40] cool, thanks [21:42:43] !log eevans@ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [21:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:07] It seems sensible to me, and Tyler's response to Timo (in the Gerrit comments) sounds right to me [21:45:18] +1 [21:48:01] well, if you want to deploy the fix to scap pull then i dont need the deploy to mw2250. but maybe that needs to be in proper swat [21:48:12] there is no particular rush though either way [21:49:44] !log Cleaned up stale striker logs on labweb1001 and labweb1002. Logs go to journald now so log rotate is not triggered to rotate out logs from before that change. [21:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:01] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10kzimmerman) @fsero I believe we're good to go on the legal/management side [21:52:08] 10Operations, 10DBA, 10MediaWiki-General, 10TechCom: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Krinkle) [21:57:58] (03PS1) 10Eevans: sessionstore (staging): restore staging port to 8081 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524362 [21:58:34] (03CR) 10Eevans: [V: 03+2 C: 03+2] "Self-merging trivial change (t-shooting container startup issues)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/524362 (owner: 10Eevans) [22:00:17] mutante: well, scap is deployed via debs :/ [22:00:20] !log eevans@ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [22:00:23] via a deb* [22:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:29] PROBLEM - puppet last run on auth1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:00:52] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10Papaul) @ayounsi please pick a day and time sometimes next week (Mon - Wed) and let me know. Thanks [22:01:32] greg-g: oh.. right [22:03:34] :) [22:03:53] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10Papaul) @ayounsi please pick a day and time sometimes next week (Mon - Wed) and let me know. Thanks [22:07:50] (03CR) 10Ayounsi: Add Fastnetmon to the netinsights role (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [22:08:21] (03PS3) 10Ayounsi: Add Fastnetmon to the netinsights role [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) [22:09:19] (03CR) 10jerkins-bot: [V: 04-1] Add Fastnetmon to the netinsights role [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [22:10:59] (03PS4) 10Ayounsi: Add Fastnetmon to the netinsights role [puppet] - 10https://gerrit.wikimedia.org/r/524253 (https://phabricator.wikimedia.org/T226810) [22:15:15] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:15:35] PROBLEM - High lag on wdqs1010 is CRITICAL: 3.383e+04 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:16:03] RECOVERY - WDQS HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:16:18] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:27] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:17:07] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:18:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [22:20:15] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [22:21:21] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:21:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:21:39] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:22:03] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:22:31] RECOVERY - WDQS HTTP Port on wdqs1008 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:23:15] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1008 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:23:16] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=wdqs1008.eqiad.wmnet [22:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:41] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:25:04] 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10ayounsi) Only solution I found so far on Juniper is to deactivate/activate that syslog target (tested with cr2-esams). [22:28:21] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1239 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [22:28:47] RECOVERY - puppet last run on auth1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:30:43] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:31:17] what's going on? [22:31:38] I see... OSPF alerts, 5xx, alerts, wdqs alerts? [22:35:01] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:35:01] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [22:35:41] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:36:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:40:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [22:41:19] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [22:41:28] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 2.92e+04 ge 3600 Gehel catching up on lag after data transfer - https://phabricator.wikimedia.org/T228122 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:44:30] (03PS1) 10Andrew Bogott: bootstrap-vz: include more packages in the buster base image [puppet] - 10https://gerrit.wikimedia.org/r/524372 [22:45:33] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: include more packages in the buster base image [puppet] - 10https://gerrit.wikimedia.org/r/524372 (owner: 10Andrew Bogott) [22:47:37] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:48:48] (03PS1) 10Nuria: Correcting package name in transform function [puppet] - 10https://gerrit.wikimedia.org/r/524374 (https://phabricator.wikimedia.org/T227150) [22:49:07] (03CR) 10jerkins-bot: [V: 04-1] Correcting package name in transform function [puppet] - 10https://gerrit.wikimedia.org/r/524374 (https://phabricator.wikimedia.org/T227150) (owner: 10Nuria) [22:52:19] (03PS2) 10Nuria: Correcting package name in transform function [puppet] - 10https://gerrit.wikimedia.org/r/524374 (https://phabricator.wikimedia.org/T227150) [22:57:25] (03CR) 10Cwhite: "Appears to do the right thing: https://puppet-compiler.wmflabs.org/compiler1001/17489/" [puppet] - 10https://gerrit.wikimedia.org/r/524288 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [22:58:48] (03CR) 10BBlack: [C: 03+2] Correcting package name in transform function [puppet] - 10https://gerrit.wikimedia.org/r/524374 (https://phabricator.wikimedia.org/T227150) (owner: 10Nuria) [23:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T2300). Please do the needful. [23:00:04] Lucas_WMDE: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:19] (03PS2) 10Jforrester: modules/varnish/templates/text-frontend.inc.vcl.erb: Fix doc reference to renamed variable [puppet] - 10https://gerrit.wikimedia.org/r/514394 [23:00:38] o/ [23:00:51] (03CR) 10jerkins-bot: [V: 04-1] modules/varnish/templates/text-frontend.inc.vcl.erb: Fix doc reference to renamed variable [puppet] - 10https://gerrit.wikimedia.org/r/514394 (owner: 10Jforrester) [23:00:59] I can swat in a few minutes [23:01:21] I can also deploy my changes myself [23:01:34] they should all be production no-ops but I’ll still briefly test them on mwdebug1002 [23:02:02] .now [23:02:03] (03PS3) 10Jforrester: varnish/templates/text-frontend.inc.vcl.erb: Fix doc ref to renamed variable [puppet] - 10https://gerrit.wikimedia.org/r/514394 [23:02:05] jouncebot: now [23:02:05] For the next 0 hour(s) and 57 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190718T2300) [23:03:31] need to rebase my changes first, apparently there’s a merge conflict [23:03:44] (due to appending settings at the end of IS.php, most likely) [23:04:07] OK go for it [23:04:10] Ping me if you need me [23:04:14] (03PS3) 10Lucas Werkmeister (WMDE): Define settings for Citoid+Wikibase integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523139 (https://phabricator.wikimedia.org/T228414) [23:04:16] (03PS3) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['enableRefTabs'] in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523140 (https://phabricator.wikimedia.org/T228414) [23:04:18] (03PS3) 10Lucas Werkmeister (WMDE): Configure Citoid+Wikibase integration on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (https://phabricator.wikimedia.org/T228411) [23:04:33] alright, thanks [23:04:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523139 (https://phabricator.wikimedia.org/T228414) (owner: 10Lucas Werkmeister (WMDE)) [23:06:04] (03Merged) 10jenkins-bot: Define settings for Citoid+Wikibase integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523139 (https://phabricator.wikimedia.org/T228414) (owner: 10Lucas Werkmeister (WMDE)) [23:06:32] (03CR) 10jenkins-bot: Define settings for Citoid+Wikibase integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523139 (https://phabricator.wikimedia.org/T228414) (owner: 10Lucas Werkmeister (WMDE)) [23:06:49] testing first change on debug [23:07:30] wiki isn’t dead and I can’t test much more ^^ syncing [23:08:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523140 (https://phabricator.wikimedia.org/T228414) (owner: 10Lucas Werkmeister (WMDE)) [23:09:04] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:523139|Define settings for Citoid+Wikibase integration (T228414)]] (duration: 00m 55s) [23:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:12] T228414: Deploy Citoid Wikibase integration - https://phabricator.wikimedia.org/T228414 [23:09:23] (03Merged) 10jenkins-bot: Set $wgWBRepoSettings['enableRefTabs'] in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523140 (https://phabricator.wikimedia.org/T228414) (owner: 10Lucas Werkmeister (WMDE)) [23:09:38] (03CR) 10jenkins-bot: Set $wgWBRepoSettings['enableRefTabs'] in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523140 (https://phabricator.wikimedia.org/T228414) (owner: 10Lucas Werkmeister (WMDE)) [23:10:14] testing second change on debug [23:11:21] wiki fine, setting still false as it should be, syncing [23:13:01] (03PS1) 10Eevans: sessionstore: update to Kask v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524377 [23:13:09] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:523140|Set $wgWBRepoSettings[enableRefTabs] in Wikibase.php (T228414)]] (duration: 01m 16s) [23:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:22] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Papaul) 05Open→03Resolved This is complete [23:13:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (https://phabricator.wikimedia.org/T228411) (owner: 10Lucas Werkmeister (WMDE)) [23:14:11] (03CR) 10Cwhite: "This seems like a good use case for anycast. Once check_cmd is reliable, let's try it." [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [23:14:45] (03Merged) 10jenkins-bot: Configure Citoid+Wikibase integration on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (https://phabricator.wikimedia.org/T228411) (owner: 10Lucas Werkmeister (WMDE)) [23:14:51] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/524274 (owner: 10Jbond) [23:15:18] testing third change on debug [23:15:48] everything still looking fine [23:16:19] (03CR) 10jenkins-bot: Configure Citoid+Wikibase integration on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (https://phabricator.wikimedia.org/T228411) (owner: 10Lucas Werkmeister (WMDE)) [23:17:23] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:523141|Configure Citoid+Wikibase integration on Beta (production no-op) (T228411)]] (duration: 00m 54s) [23:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:30] T228411: Deploy Citoid Wikibase integration to Beta Wikidata - https://phabricator.wikimedia.org/T228411 [23:18:06] checking logstash just in case… [23:18:46] everything seems to be in order (I’ll have to wait for the Beta scap to test the feature there) [23:18:59] RoanKattouw: anything else to do in the SWAT or should I close it? [23:19:10] (nothing in the calendar, at least) [23:22:17] Lucas_WMDE: If nothing else in the calendar, you can call it done. :) [23:22:29] !log Evening SWAT done [23:22:34] almost wrote EU SWAT ^^ [23:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:44] and the feature is working on Beta as well, yay! [23:24:08] \o/ [23:24:15] (03PS3) 10TheDJ: Add decimal seek offset for videos [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/524330 (https://phabricator.wikimedia.org/T228467) [23:30:39] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [23:31:09] alright, good night everyone :) [23:34:40] (03PS4) 10TheDJ: Add decimal seek offset for videos [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/524330 (https://phabricator.wikimedia.org/T228467) [23:38:28] (03PS6) 10Jeena Huneidi: Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) [23:39:42] (03PS7) 10Jeena Huneidi: Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) [23:40:32] (03CR) 10Jeena Huneidi: "I added README.md" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [23:46:35] (03PS8) 10Jeena Huneidi: Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) [23:50:46] !log built new scap version 3.11.1-1 on boron, copied to install1002, imported package with reprepro, copied from stretch to jessie and buster (T228482) [23:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:54] T228482: Deploy scap 3.11.1-1 - https://phabricator.wikimedia.org/T228482