[00:01:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_cxserver_cluster_codfw,swagger_check_mobileapps_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:11:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:16:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:17:00] (03CR) 10Bstorm: [C: 03+1] "This looks right. I wonder where we want to cut a new release in all these patches." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578408 (https://phabricator.wikimedia.org/T246689) (owner: 10BryanDavis) [00:18:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:23:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:25:36] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1531.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:33:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:38:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:45:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:47:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:49:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:54:25] * Krinkle staging on mwdebug1002 [00:55:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:57:15] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Formatters/: Ic77b2c6b33a, T247458 (duration: 01m 12s) [00:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:20] T247458: PHP Notice: Undefined index: wgKartographerLiveData - https://phabricator.wikimedia.org/T247458 [01:03:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:07:34] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1543.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:10:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:28:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:34:44] 10Operations, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Deployment services): On beta, scap can't clear opcache on some mw servers - https://phabricator.wikimedia.org/T237033 (10Krinkle) Confirmed this is still happening on every beta deploy ([latest](https://integr... [01:34:59] 10Operations, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Deployment services): Sap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle) [01:35:04] 10Operations, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Deployment services): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle) [01:35:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:43:00] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [01:43:02] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [01:43:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:45:24] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:47:06] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) Here is another mysterious mis-call: [Logstash single document](https://logstash.wikimedia.org/app/kiba... [01:50:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:58:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:13:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:28:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:32:28] RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:33:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_cxserver_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:43:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:48:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:49:04] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [03:03:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:10:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:18:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:25:08] (03CR) 10Jforrester: [C: 03+2] Add prod domains to beta CSP policy to allow easier gadget testing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580127 (owner: 10Brian Wolff) [03:25:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:26:01] (03Merged) 10jenkins-bot: Add prod domains to beta CSP policy to allow easier gadget testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580127 (owner: 10Brian Wolff) [03:33:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:38:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:53:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:58:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:04:02] (03CR) 10BryanDavis: "> This looks right. I wonder where we want to cut a new release in" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578408 (https://phabricator.wikimedia.org/T246689) (owner: 10BryanDavis) [04:05:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:13:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:16:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:28:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:33:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:43:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:53:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:58:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:03:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:08:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:18:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:23:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:32:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:35:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:36:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:43:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:03:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:08:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:21:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:26:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:35:35] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) p:05High→03Medium >>! In T247788#5974293, @brennen wrote: > May be related to T247562. Thanks - however, I don't think it i... [06:36:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:38:42] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) [06:40:22] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) MySQL config is the same: ` 12 config differences Variable pc1008 pc1007 ====================... [06:45:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:51:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:56:39] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Raid configuration is the same (included the cache policy): ` --- pc1007.raid 2020-03-17 06:45:54.531009723 +0000 +++ pc1008.raid... [07:02:29] 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10ArielGlenn) I don't have privileges to block the account on the wikis. CC-ing @MoritzMuehlenhoff in hopes that he'll know what we usually do in such cases. I can't find a name like he... [07:03:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:07:12] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) From what I can see on the graphs, none of the hosts reached disk or CPU saturation, but pc1008 did: {F31686354} {F31686356}... [07:10:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:20:10] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) I have also checked that all the FS are mounted with the same options, and they are. [07:23:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:23:50] (03PS1) 10Vgutierrez: ATS: Add session_ticket_number to Inbound_TLS_settings [puppet] - 10https://gerrit.wikimedia.org/r/580174 (https://phabricator.wikimedia.org/T170567) [07:30:21] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Table fragmentation % is almost the same on pc1007 and pc1008 [07:33:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:38:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:42:29] (03PS1) 10Elukey: Test for analytics1031 [puppet] - 10https://gerrit.wikimedia.org/r/580176 [07:44:29] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) I don't see any obvious issues on the host itself or its database itself. Ideas: 1) Upgrade to buster + 10.4 and start testing i... [07:49:20] 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10WMDE-leszek) The person was not an engineering/technical staff member, and as far as I can tell, never had a wikitech account. [07:50:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:55:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:57:11] (03PS2) 10Elukey: Test for analytics1031 [puppet] - 10https://gerrit.wikimedia.org/r/580176 [07:58:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:00:13] (03PS3) 10Elukey: Test for analytics1031 [puppet] - 10https://gerrit.wikimedia.org/r/580176 [08:03:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:08:13] (03PS4) 10Elukey: Simplify hiera configuration for the Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/580176 [08:10:06] (03CR) 10Vgutierrez: "Change is a NOOP according to pcc: https://puppet-compiler.wmflabs.org/compiler1001/21452/" [puppet] - 10https://gerrit.wikimedia.org/r/580174 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [08:10:39] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list Deutschschweiz - https://phabricator.wikimedia.org/T247737 (10ArielGlenn) Could you please give a description of the mailing list? We need it for the list info page. See https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list Thanks! [08:10:45] 10Operations, 10Wikimedia-Mailing-lists: Request new mailing list for Myanmar Wikimedia Community User Group - https://phabricator.wikimedia.org/T247647 (10ArielGlenn) Could you please give a description of the mailing list? We need it for the list info page. See https://meta.wikimedia.org/wiki/Mailing_lists#C... [08:12:37] 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10ArielGlenn) If they did not have a wikitech account, they would not be in LDAP. So it's just a matter of the wiki account now. [08:13:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:17:48] (03PS5) 10Elukey: Simplify hiera configuration for the Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/580176 [08:20:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:22:54] 10Operations, 10netops: mr1-esams i2c syslog flood - https://phabricator.wikimedia.org/T242097 (10ayounsi) {F31686555} Problem solved after remote hands replaced the issue. [08:23:07] 10Operations, 10netops: mr1-esams i2c syslog flood - https://phabricator.wikimedia.org/T242097 (10ayounsi) 05Open→03Resolved [08:23:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:24:00] (03PS6) 10Elukey: Simplify hiera configuration for the Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/580176 [08:24:32] (03CR) 10Elukey: [C: 03+2] "pcc looks reasonable: https://puppet-compiler.wmflabs.org/compiler1001/21453/" [puppet] - 10https://gerrit.wikimedia.org/r/580176 (owner: 10Elukey) [08:28:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:38:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:39:32] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [08:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:41] hadoop test cluster --^ [08:39:45] \o/ [08:43:46] (03PS1) 10Elukey: sre.hadoop.roll-restart-workers.py: adjust default settings [cookbooks] - 10https://gerrit.wikimedia.org/r/580183 [08:44:22] (03CR) 10Ema: [C: 03+2] cache: decrease varnish-frontend malloc cache size [puppet] - 10https://gerrit.wikimedia.org/r/579906 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema) [08:45:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_mobileapps_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:49:59] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) p:05Medium→03High The increase on writes hasn't stopped: {F31686609} [08:52:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:53:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:54:50] (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-workers.py: adjust default settings [cookbooks] - 10https://gerrit.wikimedia.org/r/580183 (owner: 10Elukey) [09:00:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:09:49] (03CR) 10Hashar: "Tested locally and it worked to build Zuul and its wheel. I even got a virtualenv that seems to more or less work." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [09:09:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [09:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:20] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) [09:10:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:10:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) [09:15:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:17:22] (03CR) 10Ema: [C: 03+2] cache: limit upload transient storage usage [puppet] - 10https://gerrit.wikimedia.org/r/579907 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema) [09:21:29] (03PS1) 10Elukey: Refactor default JVM heap settings in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/580188 [09:21:46] !log cp: rolling varnish-frontend-restart to decrease memory usage and apply transient storage limits T185968 [09:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:51] T185968: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 [09:22:33] (03CR) 10Elukey: [C: 03+2] Refactor default JVM heap settings in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/580188 (owner: 10Elukey) [09:25:53] (03CR) 10Ema: [C: 03+1] ATS: Add session_ticket_number to Inbound_TLS_settings [puppet] - 10https://gerrit.wikimedia.org/r/580174 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [09:27:00] (03PS2) 10Alexandros Kosiaris: Showcase redis pass population for changeprop [labs/private] - 10https://gerrit.wikimedia.org/r/574713 (https://phabricator.wikimedia.org/T213193) [09:27:04] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Showcase redis pass population for changeprop [labs/private] - 10https://gerrit.wikimedia.org/r/574713 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [09:27:30] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [09:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:40] test again --^ [09:32:08] (03PS7) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) [09:32:10] (03PS5) 10Ema: cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) [09:32:12] (03PS5) 10Ema: cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) [09:47:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:48:57] (03PS1) 10Ladsgroup: Set up read new term store up to Q60M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580285 (https://phabricator.wikimedia.org/T219123) [09:51:29] Quickly deploying this ^ [09:51:50] (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q60M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580285 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [09:52:08] ACKNOWLEDGEMENT - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 3 days ago: Most recent backup 2020-03-13 23:42:48 Jcrespo ongoing right now after failed once - The acknowledgement expires at: 2020-03-18 09:51:20. https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:52:44] (03Merged) 10jenkins-bot: Set up read new term store up to Q60M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580285 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [09:54:19] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q60M (T219123)]] (duration: 01m 09s) [09:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:24] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [09:55:53] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q60M (T219123)]], take II (duration: 01m 05s) [09:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [09:57:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:41] !log create kafka topic atskafka_test_webrequest_text T247497 [10:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:50] T247497: Test atskafka deployment - https://phabricator.wikimedia.org/T247497 [10:02:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:03:03] !log warming up cache for Q60M to Q70M for new term store on db1111, db1126, db1104, db1092 (T219123) [10:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:07] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:03:07] (03PS1) 10Giuseppe Lavagetto: Switch eventgate-analytics to go through envoy everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580286 (https://phabricator.wikimedia.org/T247484) [10:04:29] Amir1: woo [10:04:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:04:51] addshore: yup, I think we can start stop writing to it today [10:05:47] Just keep an eye on https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation as it would be good to have a stable load there before swithcing off writing so we know we don't have to roll anyhting baclk [10:07:47] (03PS1) 10Vgutierrez: ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) [10:07:49] <_joe_> !log sudo cumin -b2 -s 50 'A:mw-jobrunner' 'restart-php7.2-fpm' T247622 [10:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:54] T247622: RunSingleJob.php timeout too low at 180 seconds - https://phabricator.wikimedia.org/T247622 [10:08:03] (03CR) 10Ema: [C: 03+2] atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:08:12] addshore: REgarding deadlocks, I get it for new pages: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.03.17/mediawiki?id=AXDn9KBGh3Uj6x1zcazQ&_g=h@44136fa [10:08:15] (03CR) 10Ema: [C: 03+2] cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:08:24] and this page has only one term [10:08:26] (03CR) 10Ema: [C: 03+2] cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:08:52] (03CR) 10Vgutierrez: [C: 03+2] ATS: Add session_ticket_number to Inbound_TLS_settings [puppet] - 10https://gerrit.wikimedia.org/r/580174 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [10:08:52] https://www.wikidata.org/wiki/Q87808739 [10:09:36] akosiaris: ok to merge your labs/private changes? [10:10:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:10:38] (03CR) 10jerkins-bot: [V: 04-1] ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [10:14:09] (03PS1) 10Ammarpad: Restrict short URL management log to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580290 (https://phabricator.wikimedia.org/T221073) [10:14:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:16:10] (03PS2) 10Ammarpad: Restrict short URL management log to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580290 (https://phabricator.wikimedia.org/T221073) [10:16:40] <_joe_> uhm what's up with fatals? [10:17:32] akosiaris: I'm merging your labs/private change, it's blocking the production merges [10:17:49] <_joe_> whoever is releasing [10:17:53] <_joe_> please strop [10:17:56] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10akosiaris) [10:18:17] <_joe_> jouncebot: next [10:18:17] In 0 hour(s) and 41 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1100) [10:18:56] <_joe_> marostegui: I see a lot of db errors coming from s8 [10:19:06] vgutierrez: oops sorry, thanks! [10:19:24] ema: your commit got fully merged now :) [10:19:53] vgutierrez: yup, seen that! ty [10:20:01] !log bounce squid on install1003 T247759 [10:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:05] T247759: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 [10:20:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch eventgate-analytics to go through envoy everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580286 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [10:21:41] (03PS2) 10Vgutierrez: ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) [10:22:13] (03CR) 10jerkins-bot: [V: 04-1] ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [10:23:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:23:28] (03CR) 10Vgutierrez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [10:24:28] (03PS1) 10Alexandros Kosiaris: Add k8s dummy tokens for 3 new services. [labs/private] - 10https://gerrit.wikimedia.org/r/580294 (https://phabricator.wikimedia.org/T241230) [10:27:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:28:32] (03PS3) 10Vgutierrez: ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) [10:30:13] (03PS1) 10Alexandros Kosiaris: Kubernetes: Create token stanzas for some new services [puppet] - 10https://gerrit.wikimedia.org/r/580295 (https://phabricator.wikimedia.org/T241230) [10:30:37] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) There were issues probably related to T247562 (as @brennen pointed out) throughout the day including the times we had the connec... [10:30:54] (03PS1) 10Filippo Giunchedi: squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) [10:31:49] (03CR) 10jerkins-bot: [V: 04-1] squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [10:32:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:34:33] not sure if correlates, but since your deploy Amir1, there is high errors and fatals [10:34:48] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=1584427531438&to=1584441187310 [10:35:05] the baseline increased [10:35:49] (03CR) 10Ema: ATS: Consider TLSv1.3 on tls.lua (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [10:35:57] "A data update callback triggered an exception (A database query error has occurred. Did you" [10:36:06] "Commit failed on server(s) XXX" [10:36:23] this is the same error I told you yesterday about [10:36:41] "Cannot execute query from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate while tran" [10:36:53] that points to a logical error on the way transactions are happening [10:36:58] (write ones) [10:37:11] ^CC _joe_ that asked before about the errors [10:37:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:39:26] (03PS1) 10Jcrespo: Revert "Set up read new term store up to Q40M" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580297 [10:39:41] (03CR) 10jerkins-bot: [V: 04-1] Revert "Set up read new term store up to Q40M" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580297 (owner: 10Jcrespo) [10:40:48] !log sec update for libgraphicsmagick on maps [10:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:58] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [10:40:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [10:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:22] (03PS2) 10Filippo Giunchedi: squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) [10:41:52] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [10:41:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [10:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:08] (03PS2) 10Jcrespo: Revert "Set up read new term store up to Q40M" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580297 [10:42:18] (03CR) 10jerkins-bot: [V: 04-1] squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [10:43:03] Amir1: addshore https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/580297 [10:43:09] (03CR) 10Filippo Giunchedi: "CI is failing on running tests because the system doesn't support systemd?" [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [10:43:45] if some puppet/rspec wizards want to look at ^ I really don't feel like going the rabbit hole [10:44:02] certainly not now that all I'm trying to do is actually fix stuff [10:44:35] godog: just in the middle of something now but can take a look in a bit [10:44:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:45:00] jbond42: thanks, appreciate it [10:45:14] np [10:45:27] hashar: a deploy is breaking mw, and the deployers are not responding- should I revert? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/580297 [10:46:01] *goes to laptop* [10:46:24] * addshore reads up [10:47:23] going from 60 to 35 million might have unforseen issues (as the cache of wb_terms would be cold now) due to reads not happening there in a whole [10:47:24] *while [10:47:37] if a rollback is needed then it would be better to just roll back the last 1 config change [10:47:44] Amir1: ^^ [10:47:44] can we revert to not have fatals? [10:48:03] * addshore opens logstash [10:48:20] top 3 are wikidata writes [10:48:40] jynus: there are not related to my patch [10:48:41] the others are longterm background issues (long requests) [10:48:49] it's people creating items [10:48:54] Amir1: but it got worse since the last deploy [10:49:00] (which is related to my yesterday's patch) [10:49:09] (03PS1) 10Jbond: debdeplot: add libGraphicsMagick-Q16 as a lib for graphicsmagick [puppet] - 10https://gerrit.wikimedia.org/r/580298 [10:49:13] Amir1: see: https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=1584427531438&to=1584441187310 [10:49:59] jynus: compare it to this: https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=now-3h&to=now&fullscreen&panelId=10 [10:50:03] Amir1: why are creations causing deadlocks? do you have an example? [10:50:27] addshore: I sent one to you half an hour ago here, let me dig it up [10:50:31] addshore: every entry on logstash is a deadlock almost right now [10:50:50] addshore: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.03.17/mediawiki?id=AXDn9KBGh3Uj6x1zcazQ&_g=h@44136fa [10:50:56] sorry, I'm gonna end up being ill from now for the coming weeks [10:51:03] lol [10:51:05] nope [10:51:18] 10Operations, 10Patch-For-Review: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 (10CDanis) [10:51:40] Amir1: which item is that for? [10:52:04] addshore: here you have 1400 deadlocks: https://logstash.wikimedia.org/goto/eb8cafd6e6209daca626f1807368331d :-D [10:52:08] (03CR) 10Addshore: [C: 04-1] "going from 60 to 35 million might have unforseen issues (as the cache of wb_terms would be cold now) due to reads not happening there in a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580297 (owner: 10Jcrespo) [10:52:28] addshore: Q87808739 [10:52:34] one label only [10:52:39] * godog brb [10:52:41] (03Abandoned) 10Jcrespo: Revert "Set up read new term store up to Q40M" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580297 (owner: 10Jcrespo) [10:53:11] Amir1: sounds like the item creation code path is doing something odd we need to investigate [10:53:23] I feel it's several items stepping on each other's toes [10:53:45] (03PS1) 10Elukey: Reduce hiera overrides for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/580300 [10:53:58] the thing is, right now it is not a huge concern (except the ongoing mw fatals alarm here) [10:54:11] but I worry it may end up worse as load increases [10:54:18] Amir1: is Q87808739 for the log message you sent me? or for a different one? [10:54:34] yup, it's the same [10:54:35] if it is localized, meaning the impact is low, that is good news [10:54:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:56:04] so it's only on new items being created. The load would reduce if we stop writing to the old term store (currently, everything is on write both mode) but I can't say if it go to zero once we are done with migration [10:56:34] Amir1: but what on earth is "INSERT INTO `wbt_item_terms` (wbit_item_id,wbit_term_in_lang_id) VALUES (87808739,'439496236')" deadlocking with? [10:56:39] my philosophy was going to be revert to solve the increase in fatals [10:56:41] !log add extra prepend to LG export filter [10:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:52] fix the writing pattern, with time [10:56:56] then continue with the deploy [10:57:18] well, changing the read value shouldn't affect how often this deadlock happens [10:57:29] for that we would have to decrease the amount of writes to the new store instead [10:57:31] addshore: note if it is a long transaction, php may be reporting on the wrong write [10:57:40] it could be a previous write [10:57:53] I can check the master what is the "most recent deadlock" [10:57:56] with extra details [10:58:22] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/21458/" [puppet] - 10https://gerrit.wikimedia.org/r/580300 (owner: 10Elukey) [10:58:32] jynus: aaah okay! [10:58:32] jynus: let us investigate and see if we can spot anything obvious, if not, I revert the write patch (not he read ones) [10:58:39] sure [10:58:45] jynus: yup, that would be great [10:58:49] I was only going to revert if I couldn't get ahead of you [10:58:56] no rush if someone is looking at it [10:59:01] (03CR) 10Elukey: [C: 03+2] Reduce hiera overrides for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/580300 (owner: 10Elukey) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:09] addshore: https://phabricator.wikimedia.org/P10709 [11:01:38] addshore: I just edited because my initial paste was cut [11:01:54] INSERT /* Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStoreWriter::acquireAndInsertTerms */ INTO `wbt_item_terms` (wbit_item_id,wbit_term_in_lang_id) VALUES (87812490,'439507950') [11:02:08] was deadlocking with [11:02:10] INSERT /* Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStoreWriter::acquireAndInsertTerms */ INTO `wbt_item_terms` (wbit_item_id,wbit_term_in_lang_id) VALUES (87812491,'439507951') [11:02:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:03:23] the ids change, but always the same pattern [11:03:40] jynus: thanks, I take a look, it's too new items stepping on each other's toes [11:03:48] *two [11:04:07] maybe some part of the transaction can be separated [11:04:30] I can put this at the end I think [11:04:43] We've done this before [11:04:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:06:23] (03PS3) 10Jbond: squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [11:06:40] godog: ^^ should be fixed ping me if you want me to go through anything [11:07:34] (03CR) 10jerkins-bot: [V: 04-1] squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [11:07:46] apparently not lol, bet its rubocop [11:08:41] (03PS4) 10Jbond: squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [11:08:47] yep that one shold be better ^^ [11:09:30] (03PS2) 10Alexandros Kosiaris: restrouter: undeploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/573248 (https://phabricator.wikimedia.org/T242461) [11:10:24] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [11:11:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: undeploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/573248 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [11:12:03] (03Merged) 10jenkins-bot: restrouter: undeploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/573248 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [11:15:40] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [11:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:54] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'restrouter' for release 'production' . [11:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:06] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [11:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:32] !log T242461 undeploy restrouter. Unused service and per task to not be used after all [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:36] T242461: restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 [11:17:31] (03PS2) 10Alexandros Kosiaris: restrouter: Fully remove the helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/573249 (https://phabricator.wikimedia.org/T242461) [11:17:45] jbond42: ooh thanks!! appreciate it :) looks good to me [11:18:10] (03PS1) 10Elukey: Remove unnecessary hiera config for analytics1034 [puppet] - 10https://gerrit.wikimedia.org/r/580301 [11:19:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Fully remove the helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/573249 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [11:19:16] (03Merged) 10jenkins-bot: restrouter: Fully remove the helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/573249 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [11:19:36] 10Operations, 10Wikimedia-Mailing-lists: Request new mailing list for Myanmar Wikimedia Community User Group - https://phabricator.wikimedia.org/T247647 (10Ninjastrikers) [11:19:43] (03PS2) 10Alexandros Kosiaris: admin: Remove calico restrouter rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/573250 (https://phabricator.wikimedia.org/T242461) [11:19:58] cdanis: FYI squid on install1003 should be fixed already (re: webpagetest) as I deployed the fix off-puppet [11:20:09] coool [11:20:14] godog: ah, ty, I'll look at the dashboard before doing anything [11:20:14] https://grafana.wikimedia.org/d/i5YA-BXWz/squid?orgId=1 [11:20:22] (the perf one) [11:20:28] does indeed look much better [11:20:32] 10Operations, 10Wikimedia-Mailing-lists: Request new mailing list for Myanmar Wikimedia Community User Group - https://phabricator.wikimedia.org/T247647 (10Ninjastrikers) I made a update to the task description. Thanks. [11:20:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Remove calico restrouter rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/573250 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [11:20:58] yeah no doubt, I'm surprised things weren't more broken in obvious ways [11:21:00] (03Merged) 10jenkins-bot: admin: Remove calico restrouter rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/573250 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [11:21:08] godog: was install1003 the sole showing the issue simply because it is more loaded than the others? [11:21:28] hashar: yeah that's very likely it [11:21:51] (03CR) 10Jbond: [C: 03+1] "lgtm a very minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [11:22:24] godog: thank you. If building adocker container through the webproxy fails again, I will ping the task. But the file limit raise seems like a good fix :] [11:22:27] 10Operations, 10Patch-For-Review: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 (10fgiunchedi) I've bumped the limits for squid on install1003 and things look good now, the permanent fix is in https://gerrit.wikimedia.org/r/580296 [11:22:52] hashar: aye, I'm quite sure now the proxy will work as expected [11:23:51] (03CR) 10Elukey: [C: 03+2] Remove unnecessary hiera config for analytics1034 [puppet] - 10https://gerrit.wikimedia.org/r/580301 (owner: 10Elukey) [11:25:29] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list Deutschschweiz - https://phabricator.wikimedia.org/T247737 (10Lantus) Description: internal communication between the volunteer to plan and fix activities, to-dos and so on. [11:26:56] (03CR) 10Filippo Giunchedi: squid3: bump max open file descriptors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [11:35:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:38:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:40:08] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) The cronjobs that were ran from `mwmaint1002` around the times: ` Mar 16 17:50:01 mwmaint1002 CRON[164787]: (www-data) CMD (/us... [11:41:21] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) [11:44:24] 10Operations, 10Patch-For-Review: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 (10hashar) >>! In T224576#5972281, @hashar wrote: > When building a docker container on contint1001.wikimedia.org with docker-pkg, pip gets proxy timeout error when using `http://webproxy.eqiad.wmnet:80... [11:44:41] 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10jbond) >>! In T246998#5974303, @colewhite wrote: > That CSP works well. I think cas needs to respond with an appropriate Access-Control-Allow-Origin. https://apereo.github.io/cas/5.2.x/installation/Configuration-Properties.html#... [11:46:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: check for runtime variables set for a long time [puppet] - 10https://gerrit.wikimedia.org/r/578956 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [11:46:30] (03CR) 10Hashar: [C: 03+1] "That fixed the issue I had while using pip through webproxy.eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [11:46:45] !log test pinning icinga to a subset of cpu on icinga1001 [11:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:54] <_joe_> we might see some envoy alerts pop up in the next hour, don't worry, it's expected, I'll take care of them [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1200) [12:00:50] (03CR) 10Filippo Giunchedi: [C: 03+2] squid3: bump max open file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/580296 (https://phabricator.wikimedia.org/T247759) (owner: 10Filippo Giunchedi) [12:06:36] !log cdanis@cumin1001 START - Cookbook sre.network.cf [12:06:37] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [12:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:33] 10Operations, 10Patch-For-Review: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fix is deployed, looking good! [12:07:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:15:41] 10Operations: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 (10CDanis) [12:23:43] (03PS1) 10Alexandros Kosiaris: admin: Fix some typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/580315 [12:23:45] (03PS1) 10Alexandros Kosiaris: admin: Deduplicate calico policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/580316 [12:41:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "nice job!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/580316 (owner: 10Alexandros Kosiaris) [12:48:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:57:02] (03PS4) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) [12:57:29] RECOVERY - snapshot of s4 in codfw on db1115 is OK: snapshot for s4 at codfw taken less than 3 days ago and larger than 90 GB: Last one 2020-03-17 11:33:35 from db2099.codfw.wmnet:3314 (1147 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [12:57:58] (03PS4) 10Vgutierrez: ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) [13:00:04] hashar and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1300). [13:00:12] (03CR) 10Vgutierrez: ATS: Consider TLSv1.3 on tls.lua (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [13:02:42] (03CR) 10Jbond: "updated thanks" (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [13:05:23] (03CR) 10Ottomata: "Right but, the structure is the same? services[0] = { name: ..., conf: {...}}" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [13:07:24] (03CR) 10Ottomata: [C: 03+1] Simplify hiera configuration for the Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/580176 (owner: 10Elukey) [13:17:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Fix some typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/580315 (owner: 10Alexandros Kosiaris) [13:18:08] (03Merged) 10jenkins-bot: admin: Fix some typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/580315 (owner: 10Alexandros Kosiaris) [13:30:19] !log stop puppet and turn on debug on icinga2001 - T247538 [13:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:24] T247538: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 [13:33:29] with a bit of lateness. Catching up with the train [13:36:47] (03PS5) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) [13:39:27] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) @jcrespo has suggested to do a disk performance testing just in case there's some sort of performance degradation not revealed by... [13:39:29] (03PS6) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) [13:41:42] !log Branching 1.35.0-wmf.24 # T233872 [13:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:47] T233872: 1.35.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T233872 [13:42:53] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:43:01] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:44:10] 10Operations, 10Core Platform Team, 10MediaWiki-Debug-Logger, 10observability, 10Performance-Team (Radar): MWExceptionHandler reqId sometimes differs from php-wmerrors reqId - https://phabricator.wikimedia.org/T247786 (10Anomie) Within MediaWiki, the request ID is determined inside `WebRequest::getReques... [13:46:51] (03PS1) 10Alexandros Kosiaris: kubernetes: Export a CLUSTER parameter in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/580325 [13:48:15] (03PS2) 10Alexandros Kosiaris: kubernetes: Export a CLUSTER parameter in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/580325 [13:49:57] (03PS1) 10Vgutierrez: varnish: Consider TLSv1.3 on log_xcps_info [puppet] - 10https://gerrit.wikimedia.org/r/580326 (https://phabricator.wikimedia.org/T170567) [13:50:56] (03PS1) 10Filippo Giunchedi: base: relax interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) [13:51:47] (03PS2) 10Vgutierrez: varnish: Consider TLSv1.3 on log_xcps_info [puppet] - 10https://gerrit.wikimedia.org/r/580326 (https://phabricator.wikimedia.org/T170567) [13:52:18] hashar: let me know if you're doing the deploy. I want to deploy a fix for deadlocks ASAP [13:52:24] (03CR) 10WMDE-leszek: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572928 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [13:52:45] Amir1: I am not deploying the train. Just preparing for the next iteration [13:53:01] hashar: oh okay [13:53:06] 10Operations, 10fundraising-tech-ops, 10observability, 10Patch-For-Review: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10fgiunchedi) >>! In T247538#5975450, @gerritbot wrote: > Change 580327 had a related patch set uploaded (by Filippo Giunchedi; owner:... [13:53:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes: Export a CLUSTER parameter in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/580325 (owner: 10Alexandros Kosiaris) [13:53:54] (03CR) 10Herron: [C: 03+1] base: relax interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [13:54:21] (03CR) 10WMDE-leszek: "I am back from vacation so should be able to get this moving again, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572928 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [13:55:02] (03CR) 10Volans: "inline comments on the relaxation" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [13:55:04] (03CR) 10CDanis: [C: 03+1] base: relax interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [13:55:45] (03PS1) 10Alexandros Kosiaris: Use K8S_CLUSTER everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/580329 [13:56:53] (03PS1) 10Ottomata: eventstreams - add debug mode support [deployment-charts] - 10https://gerrit.wikimedia.org/r/580330 [13:57:42] (03CR) 10Ottomata: [C: 03+1] "Oh boy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/580329 (owner: 10Alexandros Kosiaris) [13:57:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Deduplicate calico policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/580316 (owner: 10Alexandros Kosiaris) [13:57:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/580316 (owner: 10Alexandros Kosiaris) [13:58:37] (03Merged) 10jenkins-bot: admin: Deduplicate calico policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/580316 (owner: 10Alexandros Kosiaris) [13:59:07] (03PS2) 10Ottomata: eventstreams - add debug mode support [deployment-charts] - 10https://gerrit.wikimedia.org/r/580330 [13:59:15] (03PS3) 10Ottomata: eventstreams - add debug mode support [deployment-charts] - 10https://gerrit.wikimedia.org/r/580330 [14:00:13] (03CR) 10Ottomata: [C: 03+2] eventstreams - add debug mode support [deployment-charts] - 10https://gerrit.wikimedia.org/r/580330 (owner: 10Ottomata) [14:00:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Use K8S_CLUSTER everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/580329 (owner: 10Alexandros Kosiaris) [14:01:00] (03PS2) 10Alexandros Kosiaris: Use K8S_CLUSTER everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/580329 [14:01:00] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Use K8S_CLUSTER everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/580329 (owner: 10Alexandros Kosiaris) [14:03:10] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:28] it worked :-) [14:04:43] (03CR) 10Filippo Giunchedi: "IMHO useful to try at least 2-3 times here, especially because AFAIK this check determines whether the host is up or down from icinga's PO" [puppet] - 10https://gerrit.wikimedia.org/r/579329 (https://phabricator.wikimedia.org/T247538) (owner: 10Herron) [14:06:12] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:59] (03PS1) 10Ladsgroup: Set up read new term store up to Q70M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580331 (https://phabricator.wikimedia.org/T219123) [14:07:08] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [14:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:13] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:07:20] (03CR) 10Filippo Giunchedi: base: relax interval for selected checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [14:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:24] (03PS2) 10Herron: icinga: switch check_ping packet count to 2 [puppet] - 10https://gerrit.wikimedia.org/r/579329 (https://phabricator.wikimedia.org/T247538) [14:09:20] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: switch check_ping packet count to 2 [puppet] - 10https://gerrit.wikimedia.org/r/579329 (https://phabricator.wikimedia.org/T247538) (owner: 10Herron) [14:09:38] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:44] (03CR) 10Ema: [C: 03+1] base: relax interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [14:10:02] (03PS1) 10Ottomata: eventstreams - remove comment that caused bad yaml formatting [deployment-charts] - 10https://gerrit.wikimedia.org/r/580332 [14:10:24] (03PS3) 10Giuseppe Lavagetto: mediawiki::webserver: split tls proxy config out of profile [puppet] - 10https://gerrit.wikimedia.org/r/576852 [14:11:00] (03CR) 10Ottomata: [C: 03+2] eventstreams - remove comment that caused bad yaml formatting [deployment-charts] - 10https://gerrit.wikimedia.org/r/580332 (owner: 10Ottomata) [14:11:34] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) So from some tests, it looks like that pc1008's disk do perform worse for some reason: **Random reads pc1007 vs pc1008:** ` fio... [14:12:37] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:37] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:34] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: [[gerrit:580328|Store item terms at late as possible to avoid deadlocks (T247553 T246898)]] (duration: 01m 07s) [14:14:36] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:40] T246898: Wikibase\Repo\Content\DataUpdateAdapter::doUpdate: Commit failed on server(s) 10.64.48.172: Cannot execute query from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate while transaction status is ERROR - https://phabricator.wikimedia.org/T246898 [14:15:06] (03CR) 10Herron: "> IMHO useful to try at least 2-3 times here, especially because" [puppet] - 10https://gerrit.wikimedia.org/r/579329 (https://phabricator.wikimedia.org/T247538) (owner: 10Herron) [14:15:10] (03CR) 10Herron: [C: 03+2] icinga: switch check_ping packet count to 2 [puppet] - 10https://gerrit.wikimedia.org/r/579329 (https://phabricator.wikimedia.org/T247538) (owner: 10Herron) [14:17:12] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/21462/ this change is currently a noop and can be applied without worrying." [puppet] - 10https://gerrit.wikimedia.org/r/576852 (owner: 10Giuseppe Lavagetto) [14:17:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::webserver: split tls proxy config out of profile [puppet] - 10https://gerrit.wikimedia.org/r/576852 (owner: 10Giuseppe Lavagetto) [14:18:32] (03PS1) 10Marostegui: pc1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/580333 (https://phabricator.wikimedia.org/T247787) [14:20:21] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Anomie) I don't see any MediaWiki deploys in https://wikitech.wikimedia.org/wiki/Server_Admin_Log at either of these times. At 19:11 I see... [14:20:32] (03CR) 10Marostegui: [C: 03+2] pc1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/580333 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [14:21:36] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:03] jouncebot: now [14:22:03] No deployments scheduled for the next 1 hour(s) and 37 minute(s) [14:23:15] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:59] !log Stop mysql and restart pc1008 T247787 [14:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:04] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [14:24:42] (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q70M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580331 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [14:25:58] (03Merged) 10jenkins-bot: Set up read new term store up to Q70M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580331 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [14:28:00] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q70M (T219123)]] (duration: 01m 10s) [14:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:06] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:29:39] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q70M (T219123)]], take II (duration: 01m 04s) [14:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:01] <_joe_> Amir1: ^^ not sure if related tho [14:32:10] it's related [14:32:18] not to this one though [14:33:26] it's caused by this: https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=now-3h&to=now&fullscreen&panelId=10 [14:34:04] the reason it flaps non-stop is the wdqs lag stopping bots, then wdqs catch up, the bots start the storm, deadlocks appear, and so on [14:34:21] I thought my patch would have helped but apparently not [14:34:30] (03PS1) 10Jhedden: openstack: fix nova policy yaml format [puppet] - 10https://gerrit.wikimedia.org/r/580337 [14:34:46] (03PS1) 10Alexandros Kosiaris: calico: Remove unneeded helmfile.gotmpl files [deployment-charts] - 10https://gerrit.wikimedia.org/r/580338 [14:36:04] (03PS1) 10KartikMistry: apertium-eo-ca: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-eo-ca] - 10https://gerrit.wikimedia.org/r/580339 (https://phabricator.wikimedia.org/T247585) [14:36:20] jynus: can you do another one https://phabricator.wikimedia.org/P10709 again please? [14:36:57] sure [14:37:55] (03PS2) 10Alexandros Kosiaris: calico: Remove unneeded helmfile.gotmpl files [deployment-charts] - 10https://gerrit.wikimedia.org/r/580338 [14:38:10] !log mediawiki/core git push 68bc9300dc:wmf/1.35.0-wmf.24 to catch up with a change that got merged while branch is being cut # T233872 [14:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:15] T233872: 1.35.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T233872 [14:38:50] Amir1: https://phabricator.wikimedia.org/P10709#61942 [14:39:27] let me see if I can show a metric of number of deadlocks per period of time [14:40:07] (03PS1) 10Ema: Load rdkafka configuration from file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580340 (https://phabricator.wikimedia.org/T237993) [14:40:22] Amir1: https://grafana.wikimedia.org/d/000000273/mysql?from=1584283214036&to=1584456014037&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104&fullscreen&panelId=19 [14:40:34] jynus: I think I got the problem [14:40:46] see how spikes got into continuous ones at the time aproximate of the deploy [14:40:52] and now back to spikes [14:41:27] also it seems it started as a new thing on 16 at 11am [14:41:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:42:25] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:05] (03CR) 10Ema: [C: 03+1] varnish: Consider TLSv1.3 on log_xcps_info [puppet] - 10https://gerrit.wikimedia.org/r/580326 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:43:25] (03CR) 10Andrew Bogott: [C: 03+1] openstack: fix nova policy yaml format [puppet] - 10https://gerrit.wikimedia.org/r/580337 (owner: 10Jhedden) [14:44:07] !log wdqs1010 (test server) is running a data-reload cookbook (and is probably taking longer than the expected downtime) [14:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:53] (03CR) 10Jhedden: [C: 03+2] openstack: fix nova policy yaml format [puppet] - 10https://gerrit.wikimedia.org/r/580337 (owner: 10Jhedden) [14:49:23] (03PS1) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [14:51:49] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Anomie) On the MediaWiki log side, I see an increase of "Async set op failed" between 17:58 and 19:08, which seems consistent with parser ca... [14:57:56] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:19] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) >>! In T247788#5975544, @Anomie wrote: > I don't see any MediaWiki deploys in https://wikitech.wikimedia.org/wiki/Server_Admin_L... [14:58:22] (03CR) 10Ema: [C: 03+1] "One comment, lgtm otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [15:00:18] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [15:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:56] !log scap prep 1.35.0-wmf.24 and applying security patches # T233872 [15:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:00] T233872: 1.35.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T233872 [15:11:58] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [15:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:26] (03PS1) 10Ottomata: eventstreams - only deploy debug Service in dev env [deployment-charts] - 10https://gerrit.wikimedia.org/r/580344 [15:12:47] (03PS1) 10Elukey: jupyterhub: delete users from the database automatically [puppet] - 10https://gerrit.wikimedia.org/r/580345 [15:13:24] (03CR) 10RLazarus: [C: 03+2] Add a default User-Agent. [software/httpbb] - 10https://gerrit.wikimedia.org/r/580135 (owner: 10RLazarus) [15:13:26] (03PS1) 10Alexandros Kosiaris: calico: Switch codfw to common/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/580346 [15:14:41] (03CR) 10Ottomata: [C: 03+2] eventstreams - only deploy debug Service in dev env [deployment-charts] - 10https://gerrit.wikimedia.org/r/580344 (owner: 10Ottomata) [15:15:39] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Anomie) >>! In T247788#5975617, @Marostegui wrote: > And it did perform better and connections decreased, the unknown about why all parserca... [15:16:36] (03PS1) 10Filippo Giunchedi: prometheus: add icinga average latency checks [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) [15:18:08] hashar: will likely do a temporary deploy of wmf.23 to group2 for debugging purposes before long. [15:19:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:19:11] brennen: yeah that sounds good [15:19:25] what I was wondering is wether we can roll it on a per wiki basis [15:20:29] (03PS3) 10Alexandros Kosiaris: calico: Remove unneeded helmfile.gotmpl files [deployment-charts] - 10https://gerrit.wikimedia.org/r/580338 [15:20:31] (03PS2) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [15:20:35] so that for example we promote a few of the remaining wikis. Hold and watch. Then promote some others [15:22:07] (03PS1) 10Hashar: Group0 to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580349 (https://phabricator.wikimedia.org/T233872) [15:22:11] (03CR) 10Hashar: [C: 04-1] "train is on hold for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580349 (https://phabricator.wikimedia.org/T233872) (owner: 10Hashar) [15:22:12] hashar: i don't see any technical reason we couldn't, though i dunno if necessary - _joe_ have any thoughts on ^ before i just do all wikis? [15:22:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Remove unneeded helmfile.gotmpl files [deployment-charts] - 10https://gerrit.wikimedia.org/r/580338 (owner: 10Alexandros Kosiaris) [15:22:15] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add icinga average latency checks [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [15:22:20] (03Merged) 10jenkins-bot: calico: Remove unneeded helmfile.gotmpl files [deployment-charts] - 10https://gerrit.wikimedia.org/r/580338 (owner: 10Alexandros Kosiaris) [15:22:22] (03PS1) 10Jgreen: nsca_frack.cfg.erb - merge some groups, add fran1001, clean up format [puppet] - 10https://gerrit.wikimedia.org/r/580351 [15:22:29] (03PS2) 10Alexandros Kosiaris: calico: Switch codfw to common/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/580346 [15:22:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Switch codfw to common/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/580346 (owner: 10Alexandros Kosiaris) [15:22:41] <_joe_> hashar: we could, but I don't see a good advantage in doing that [15:22:49] <_joe_> unless we want to try to switch enwiki first [15:22:52] basing that on Tim mentioning there is a specific slab on mc1030 that seems to have generated most of its traffic [15:23:03] so that might help identify the key [15:23:07] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [15:23:07] or yeah do enwiki first [15:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:31] (03CR) 10Vgutierrez: Load rdkafka configuration from file (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/580340 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:23:57] (03PS1) 10Alexandros Kosiaris: admin: Move eqiad, staging calico to common/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/580357 [15:24:01] <_joe_> hashar: I'm logged in mc1030 (but ready to switch to another server) and to run memkeys [15:24:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:24:29] <_joe_> jouncebot: next [15:24:29] In 0 hour(s) and 35 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1600) [15:24:34] <_joe_> heh [15:24:36] <_joe_> we got time [15:24:41] <_joe_> let's try, brennen [15:24:56] k, readying patch [15:25:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Move eqiad, staging calico to common/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/580357 (owner: 10Alexandros Kosiaris) [15:25:17] (03Merged) 10jenkins-bot: admin: Move eqiad, staging calico to common/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/580357 (owner: 10Alexandros Kosiaris) [15:25:25] _joe_: sorry, for clarity, did you want just enwiki first? [15:25:39] <_joe_> nah let's do all [15:25:42] ack [15:25:47] <_joe_> did this cause a full outage the other day? [15:26:04] just a spike in error traffic. [15:26:22] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) Interesting, so your theory is that pc1008 is the culprit rather than the other way around. We have made some progress on the in... [15:26:24] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580358 [15:26:26] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580358 (owner: 10Brennen Bearnes) [15:26:30] mc1030 slab 136 https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1030&var-slab=136 [15:26:32] <_joe_> ok [15:27:52] and we have a memcached logstash board at https://logstash.wikimedia.org/app/kibana#/dashboard/memcached [15:27:53] <_joe_> hashar: there is no guaranttee it will be the same slab and the same server [15:28:29] ACKNOWLEDGEMENT - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data import in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:36] ahh [15:28:40] that is annoying :/ [15:28:47] (03PS5) 10Vgutierrez: ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) [15:29:15] <_joe_> hashar: it depends on hwo th cache key is built [15:30:40] 10Operations, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) Belated update: we decided to upgrade to 1.13.1 (not 1.12.3). So far it's deployed to all MW hosts in codfw, plus the MW canaries in eqiad. Monitoring for impac... [15:30:46] hashar: argh, i think i created weirdness here by running deploy-promote [15:32:20] ;D [15:32:50] <_joe_> ok, should I go make a coffee then, and wait for your ping? [15:33:17] _joe_: yeah, coffee sounds like a good idea. i will clean this up. [15:33:39] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572928 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [15:34:07] <_joe_> brennen: ack, I'll have irc with me [15:34:08] (03CR) 10Vgutierrez: ATS: Consider TLSv1.3 on tls.lua (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [15:34:11] <_joe_> ping me when you're ready [15:34:12] (03CR) 10Andrew Bogott: [C: 04-1] "In some cases keys and values need to be quoted, so these .yaml files all need a ton of quote marks added." [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [15:34:23] (03Abandoned) 10Brennen Bearnes: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580358 (owner: 10Brennen Bearnes) [15:34:25] (03CR) 10Andrew Bogott: [C: 04-1] "In some cases keys and values need to be quoted, so these .yaml files all need a ton of quote marks added." [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [15:34:32] (03CR) 10Andrew Bogott: [C: 04-1] "In some cases keys and values need to be quoted, so these .yaml files all need a ton of quote marks added." [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [15:36:28] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [15:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:38] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [15:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:05] _joe_: ready to scap sync-wikiversions [15:40:17] <_joe_> ok, go on [15:41:08] running now. [15:41:27] (03Restored) 10Hashar: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580358 (owner: 10Brennen Bearnes) [15:41:38] (03PS2) 10Filippo Giunchedi: prometheus: add icinga average latency checks [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) [15:41:40] <_joe_> I see a lot of calls to WANCache:v:global:SqlBlobStore-blob:viwiki:tt%3A58119674 but that's kinda usual [15:41:58] (03Abandoned) 10Hashar: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580358 (owner: 10Brennen Bearnes) [15:43:14] brennen: I had reset --hard /srv/mediawiki-staging to drop my leftover patch to bump to 1.35.0-wmf.24 [15:43:24] so I am not sure what we ended up syncing [15:43:34] <_joe_> ahem [15:43:56] hashar: scap is still in progress, so i'm also unsure. [15:44:03] (03PS1) 10Elukey: role::analytics_cluster::superset: add kerberos settings [puppet] - 10https://gerrit.wikimedia.org/r/580359 (https://phabricator.wikimedia.org/T239903) [15:44:13] eeek [15:44:16] I guess you can abort [15:44:19] and sync again [15:44:22] !log brennen@deploy1001 sync-wikiversions aborted: All wikis to 1.35.0-wmf.23 (duration: 03m 49s) [15:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:30] else we might be pushing 1.35.0-wmf.24 to group0 [15:44:35] <_joe_> enwiki is at .22 [15:44:37] sorry bout that :-\ [15:45:21] hashar: we were not pushing .24, fyi. will start again on sync for .23 to all. [15:45:47] yeah but the wikiversion.json on the deployment server had group0 wikis set to .24 [15:46:01] cause my commit was still on the deployment server [15:46:59] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::superset: add kerberos settings [puppet] - 10https://gerrit.wikimedia.org/r/580359 (https://phabricator.wikimedia.org/T239903) (owner: 10Elukey) [15:47:06] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [15:47:52] hashar: made sure of state of wikiversions.json, syncing again. should be correct at this point. [15:48:15] in practice i believe the previous state would have been a no-op. [15:48:32] (03CR) 10Mforns: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580359 (https://phabricator.wikimedia.org/T239903) (owner: 10Elukey) [15:48:42] (03CR) 10Filippo Giunchedi: [C: 03+2] base: relax interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [15:49:28] so there are errors in logstash [15:49:37] for SqlBlobStore-blob:enwiki:ttXXXXXX [15:49:40] <_joe_> are we syncing? [15:49:41] were XXX is some stuff [15:49:50] _joe_: we are syncing [15:50:14] <_joe_> ok rollback as soon as it's done [15:50:18] _joe_: ack [15:50:20] 10Operations: Is invite-wmfall@wikimedia.org a Mailman list? - https://phabricator.wikimedia.org/T247848 (10Aklapper) I'm afraid that only #Operations can find out if I interpret https://office.wikimedia.org/wiki/Topic:Uud4c3dic0n7y7t5 correctly (PS: please add tags :) [15:50:27] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [15:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:40] <_joe_> the key clogging mc1030 seems to be [15:50:54] <_joe_> enwiki:messages:en [15:50:57] <_joe_> of the right size too [15:51:06] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [15:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:42] <_joe_> yeah confirmed [15:51:47] <_joe_> it's doing 80 MB/s [15:52:25] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 9.517 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:52:46] <_joe_> brennen: abort and quick rollback I guess [15:52:48] !log brennen@deploy1001 sync-wikiversions aborted: All wikis to 1.35.0-wmf.23 (duration: 05m 16s) [15:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:57] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [15:52:58] _joe_: ack, doing [15:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:10] and there is again WANCache:t:arwiki:gadgets-definition:9:2 [15:53:26] but it is not the only keys showing errors so [15:53:41] <_joe_> hashar: as I told you [15:53:48] <_joe_> the key that creates mayem is [15:53:58] <_joe_> > enwiki:messages:en [15:54:01] <_joe_> 82k size [15:54:10] <_joe_> called thousands of times per second [15:55:24] and there are server error on a bunch of mc servers [15:56:12] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Reverting All wikis to 1.35.0-wmf.23 [15:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:08] (03PS1) 10Ema: atskafka: rdkafka configuration support [puppet] - 10https://gerrit.wikimedia.org/r/580364 (https://phabricator.wikimedia.org/T247497) [15:57:24] <_joe_> hashar: probably similar keys [15:57:46] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Dwisehaupt) a:03Dwisehaupt [15:58:23] <_joe_> I'll comment on the task [15:58:31] from https://logstash.wikimedia.org/app/kibana#/dashboard/memcached that is random keys on various mc servers [15:59:11] well not so random maybe. But I get errors for eg WANCache:t:arwiki:gadgets-definition:9:2 , dewiki:messages:en:lock bunch of WANCache:v:global:SqlBlobStore-blob:zhwiki:tt [15:59:13] <_joe_> hashar: I disagree [15:59:27] <_joe_> it's all the aforementioned backend [15:59:37] <_joe_> I can tell you as I'm looking at mcrouter logs [15:59:52] OHHH [15:59:52] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:59:55] _joe_: thanks for debugging and apologies for the less-than-great performance in rolling changes out. [16:00:04] I was look at the top 20 hosts, but those are mediawiki servers [16:00:04] godog and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1600). [16:00:04] Krinkle: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] not mc ones sorry! [16:00:22] <_joe_> brennen: I think it's definitely not your fault FWIW [16:00:37] <_joe_> deploying mediawiki is way more nuanced than it should be [16:00:48] <_joe_> godog: can you pick puppet-swat [16:00:50] <_joe_> ? [16:01:03] <_joe_> I wanna write this data down on the task and it's a train blocker [16:03:00] _joe_: sure, will take a look now [16:03:04] godog: if you do, my patches are https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539842/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556281/ and [16:03:45] (03PS2) 10Ema: Load rdkafka configuration from file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580340 (https://phabricator.wikimedia.org/T237993) [16:03:46] Krinkle: ack, taking a look [16:04:17] to confirm, changes to apache config in mediawiki module do not trigger apache restarts (but reload) ? [16:04:59] are you asking or telling me [16:05:08] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: change matrix.php column "grp" to "groups" [puppet] - 10https://gerrit.wikimedia.org/r/556281 (owner: 10Krinkle) [16:05:27] (03CR) 10Ema: Load rdkafka configuration from file (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/580340 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:05:59] asking, if you/anyone remembers offhand that is [16:06:32] godog: I don't, but I asked Daniel Z the same question yesterday, and he said it does a reload. Which is why it was important not to merge randomly in the weekend/friday, but also not as big a risk as a full restart. [16:07:30] 10Operations: Is invite-wmfall@wikimedia.org a Mailman list? - https://phabricator.wikimedia.org/T247848 (10bcampbell) Thanks @Aklapper . Sorry about not adding the correct tag. [16:08:09] Krinkle: fair! [16:09:29] (03CR) 10Filippo Giunchedi: [C: 03+2] Document Apache gzip sidestepping [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [16:09:52] (03PS2) 10Ema: atskafka: rdkafka configuration support [puppet] - 10https://gerrit.wikimedia.org/r/580364 (https://phabricator.wikimedia.org/T247497) [16:10:05] Krinkle: all done! [16:12:18] godog: thx [16:13:03] <_joe_> Krinkle: how do I check what's the content of a memcached key? [16:13:11] <_joe_> I mean from shell.php for instance [16:13:25] _joe_: do you have the raw key string, or a MW makeKey call? [16:13:35] <_joe_> raw key string [16:13:46] <_joe_> but I want the value mediawiki sees [16:13:59] <_joe_> so gunzipped and deserialized [16:14:13] _joe_: $cache= ObjectCache::getLocalClusterInstance(); var_dump( $cache->get( '…' ) ); [16:14:28] <_joe_> Krinkle: ack thanks [16:14:37] that's the intenral BagOStuff WAN cache and local memc operations use. [16:15:42] <_joe_> yeah it did the trick [16:20:14] (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [software/atskafka] - 10https://gerrit.wikimedia.org/r/580340 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:21:10] (03CR) 10Ema: [C: 03+2] Load rdkafka configuration from file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580340 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:21:15] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Hi @Marostegui - can you create a dc-ops task for the raid controller replacement? We'll have to pull some logs to send over to... [16:22:44] (03PS5) 10C. Scott Ananian: Update linter whitelist w/ parsoid11's IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 (https://phabricator.wikimedia.org/T246833) [16:24:29] _joe_: btw, I mentioned this to Effie a while back but I'm still keeping an eye on opcache corruption as well. and while I haven't seen many yet (I also haven't looked very hard yet), I do see them still. I keep a record at https://phabricator.wikimedia.org/T245183. Just FYI :/ [16:25:10] Looks like today's train is happening in the 'secondary timeslot' ? [16:25:15] (03CR) 10Volans: [C: 03+1] "> Patch Set 10: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [16:25:28] so long as it is making it jump to random methods that don't do anything bad or don't have the method by that name, we're good. but it seems only one step away from a call $processCache->delete('page') accidentally calling $dbw->delete('page') and doing a bobby drop table on production. [16:25:34] (03CR) 10Vgutierrez: [C: 03+2] ATS: Consider TLSv1.3 on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/580288 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [16:26:10] opcache is rewriting random code and making methods with those arguments be called on different (nearby) objects then they are called on. [16:27:13] cscott: the train from last thursday only happened a few hours ago [16:27:36] has not really happened yet, unfortunately. [16:27:45] oh [16:27:51] cscott: wmf.23 is still blocked and currently under investigation. the wmf.24 branch has been cut but won't go to group0 until .23 is resolved. [16:28:03] this may warrant another train status mail? [16:28:07] Krinkle, brennen what i was just catching up on T233871 [16:28:07] T233871: 1.35.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T233871 [16:28:33] looks like wmf.23 is still stuck, although Ladsgroup has a patch for the wikibase issue at least. not reviewed/merged yet. [16:29:43] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) [16:30:40] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Replacement Dimm has arrived [16:32:33] (03PS1) 10CRusnov: authdns-local-update: Plumb in netbox snippet dir [puppet] - 10https://gerrit.wikimedia.org/r/580371 [16:33:12] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) @Jclark-ctr The server is depooled, you can do the replacement any time. [16:33:52] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) p:05Triage→03Medium [16:34:29] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Thanks taking care of now! @Dzahn [16:35:05] brennen: sorry I have got side tracked with some paper work :/ [16:35:33] <_joe_> Krinkle: https://phabricator.wikimedia.org/T247562#5976188 is what gets wmf.23 stuck FTR [16:35:43] based on the previous and attempt and on comments made to T247562, it seems we have some lead now [16:35:46] T247562: Warning: Memcached::setMulti(): failed to set key global:segment:... - https://phabricator.wikimedia.org/T247562 [16:36:03] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:36:04] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:36:16] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:38] (03PS3) 10Herron: ELk7: add curator job to require disktype hdd after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) [16:37:04] _joe_: interesting, messagecache.php owns that key and has a lot of logic in place to prevent that from happening more or less. hasn't changed recently afaik [16:37:33] (03CR) 10Herron: ELk7: add curator job to require disktype hdd after 7 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [16:39:21] (03PS3) 10Ema: atskafka: rdkafka configuration support [puppet] - 10https://gerrit.wikimedia.org/r/580364 (https://phabricator.wikimedia.org/T247497) [16:39:53] (03PS3) 10Vgutierrez: varnish: Consider TLSv1.3 on log_xcps_info [puppet] - 10https://gerrit.wikimedia.org/r/580326 (https://phabricator.wikimedia.org/T170567) [16:41:22] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Replaced Failed drive host booting now [16:41:30] (03CR) 10Dzahn: [C: 03+2] DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [16:41:39] (03PS4) 10Dzahn: DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105 (https://phabricator.wikimedia.org/T247780) [16:42:07] (03PS3) 10Filippo Giunchedi: prometheus: add icinga average latency checks [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) [16:43:11] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10jcrespo) > There's an increase in slow parses, but that may be consistent with increased load due to increased parser cache misses. > How lo... [16:45:23] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) a:05Jclark-ctr→03Dzahn Thanks @Jclark-ctr ! I could get it per SSH now. I'll take it to get it back into production, if you are done. [16:46:53] !log mw1280 back after long downtime due to broken RAM, added back into puppet (T240187) [16:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:57] 10Operations, 10Security Incident Response, 10Security-Team: Request SRE assistance: specs for Security team's proposed analysis server - https://phabricator.wikimedia.org/T247492 (10chasemp) [16:46:58] T240187: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 [16:47:04] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10jcrespo) On the other hand, I wonder if the increase in writes (misses, to some extent) on pc1 and pc3 after failover, which is real: https:... [16:48:00] (03CR) 10Herron: [C: 03+1] prometheus: add icinga average latency checks [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [16:50:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:51:17] Krinkle: MediaWiki:spam-blacklist has the right combination of large size and recent changes [16:51:45] those are replaced and re-cached continuously, e.g. after an edit to it. [16:51:54] but it's been even larger in the last, up to 214k [16:52:18] we're looking for something that happens on the next branch of MW but not the prev, including when switching back and forth reproduces the issue [16:52:32] (03PS1) 10Michael Große: Add beta configuration for Wikibase reference formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) [16:54:08] (03CR) 10jerkins-bot: [V: 04-1] Add beta configuration for Wikibase reference formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [16:54:13] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/580371 (owner: 10CRusnov) [16:55:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:56:56] !log mw1280 - scap pull - had ancient mw version due to downtime [16:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:31] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) @wiki_willy keep in mind that I haven't been able to find any logs that shows a RAID controller malfunction unfortunately, it is... [16:58:27] PROBLEM - mediawiki-installation DSH group on mw1280 is CRITICAL: Host mw1280 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:59:09] that's because puppet just added it back, fixing [16:59:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:00:04] halfak and accraze: (Dis)respected human, time to deploy Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1700). Please do the needful. [17:00:30] (03PS2) 10Michael Große: Add beta configuration for Wikibase reference formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) [17:00:33] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Hi @Marostegui - we could try RMA'ing it (tho Dell will probably give us a hard time), if all other possibilities have been exhau... [17:01:05] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10jcrespo) We should purge some old records of pc1010, though: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from... [17:01:58] (03CR) 10jerkins-bot: [V: 04-1] Add beta configuration for Wikibase reference formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [17:03:16] (03PS1) 10Ema: atskafka: convert float64 configuration values to int [software/atskafka] - 10https://gerrit.wikimedia.org/r/580376 (https://phabricator.wikimedia.org/T237993) [17:04:00] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) >>! In T247787#5976451, @wiki_willy wrote: > Hi @Marostegui - we could try RMA'ing it (tho Dell will probably give us a hard time... [17:10:21] !log purging some old rows on pc1010 on a screen to earn some time T247788 [17:10:25] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:28] T247788: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 [17:11:45] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [17:12:12] jynus: let's clean binlogs, pc1010 has no slaves [17:12:24] nah [17:12:32] I can make room with tables [17:12:36] there is very old records there [17:12:41] but you have to optimize it too, no? [17:12:51] I would like to conserve binlogs for replication [17:13:05] I will see, don't worry [17:13:13] will say what I do on the ticket [17:13:34] no optimize if napspace is reused [17:13:46] ok [17:13:47] also, purge process will kick in soon anyway [17:13:57] I just want to avoid alerts [17:14:05] yep, makes sense [17:14:18] (03CR) 10Vgutierrez: [C: 03+2] varnish: Consider TLSv1.3 on log_xcps_info [puppet] - 10https://gerrit.wikimedia.org/r/580326 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [17:14:24] but we can purge binlogs for like 1 week or something, and that should be fine I think [17:15:09] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:12] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) after puppet runs host was added back in Icinga. then: CRITICAL: 944 mismatched wikiversions after a looong scap pull it is all green now https://icinga.wikimedia... [17:15:31] PROBLEM - Check whether ferm is active by checking the default input chain on stat1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:15:40] I will see for old ones, I am for now purging rows [17:15:53] ok! [17:16:34] as soon as the curve flattens, I will go back to playing videogames :-D [17:17:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1280.eqiad.wmnet [17:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:15] jynus: XDD [17:17:21] (03PS1) 10Mholloway: WikimediaEditorTasks: Enable Depicts counting on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580380 (https://phabricator.wikimedia.org/T247874) [17:17:37] jynus: :) [17:18:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet [17:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:38] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:04] RECOVERY - Check whether ferm is active by checking the default input chain on stat1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:19:16] purges are already kicking in: https://grafana.wikimedia.org/d/000000273/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=pc1010&var-port=9104&fullscreen&panelId=11&from=1584461942268&to=1584465542269&orgId=1 [17:19:17] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) 05Open→03Resolved 17:18 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet [17:19:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:19:49] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Enable Depicts counting on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580380 (https://phabricator.wikimedia.org/T247874) (owner: 10Mholloway) [17:20:41] (03CR) 10BBlack: [C: 03+1] "If we can assume the input is intended to be ASCII or UTF-8, this simplistic replacement actually works fine for all cases (because multi-" [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [17:20:43] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Enable Depicts counting on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580380 (https://phabricator.wikimedia.org/T247874) (owner: 10Mholloway) [17:20:45] (03PS1) 10Bartosz Dziewoński: Re-enable DiscussionTools for everyone on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580383 (https://phabricator.wikimedia.org/T247802) [17:21:18] quickly deploying this once it's there: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/580352 [17:21:27] (03PS2) 10Bartosz Dziewoński: Re-enable DiscussionTools for everyone on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580383 (https://phabricator.wikimedia.org/T247802) [17:21:30] the fatals is https://logstash.wikimedia.org/goto/c72ff9c55ff98a1c12830019eea4b411 which is a known issue [17:22:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:22:26] jynus: the fix is going in \o/ [17:22:39] bblack: just to make sure wasn't lost in the other comments in the CR, the other option suggested by chaomodus was to do .encode('ascii', errors='replace').decode('ascii'), that might include non printable chars though [17:22:44] and thanks for the review! [17:23:05] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) [17:23:53] (03CR) 10BBlack: [C: 03+1] "The additional subdirectory won't create problems for gdnsd (it's not unlike the case with the existing `helpers/` subdirectory of `ops/dn" [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [17:23:56] (03PS5) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) [17:24:08] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2 [17:24:09] my deletes can be seen at: https://grafana.wikimedia.org/d/000000273/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=pc1010&var-port=9104&from=1584464576727&to=1584465795563&fullscreen&panelId=3&orgId=1 [17:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:17] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 6 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10eprodromou) [17:24:24] doesn't seem to be affecting performance at all [17:24:51] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2 (duration: 00m 43s) [17:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:31] (03CR) 10BBlack: [C: 03+1] authdns-local-update: Plumb in netbox snippet dir [puppet] - 10https://gerrit.wikimedia.org/r/580371 (owner: 10CRusnov) [17:25:50] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@266e6da]: Update mobileapps to 6370784 [17:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:58] (03CR) 10Bartosz Dziewoński: "PS5 is the same as PS3 – it restores the config for beta cluster to keep it enabled for everyone. Per T247802, we want that configuration " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:25:59] volans: yeah I read all that, but I think the high-bit argument makes re-encoding superfluous :) [17:26:50] (03CR) 10Volans: [C: 03+1] "LGTM and needs to be sync-merged with Ia9eefdeb" [puppet] - 10https://gerrit.wikimedia.org/r/580371 (owner: 10CRusnov) [17:27:12] ack, thx :D [17:27:28] James_F: when you have some time, and if our current level of panic allows it, i'd appreciate if you could deploy https://gerrit.wikimedia.org/r/580383 (the DiscussionTools beta beta beta thing) [17:28:22] Amir1: can you roll out https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/580361/ as well? [17:28:43] MatmaRex: Yeah, not right but sure. [17:28:49] (03CR) 10Herron: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [17:28:50] Krinkle: is it the unblocker for train? [17:29:21] Amir1: … from two months ago. Just want to get it our sooner, I can verify on mwdebug. But can also roll out later if you're short on time no worries :) [17:29:31] but not this week regression, not urgent righnow [17:29:43] okay, I try to deploy it [17:29:50] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@266e6da]: Update mobileapps to 6370784 (duration: 04m 00s) [17:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:15] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) a:03Dzahn [17:30:32] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) [17:30:40] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) [17:30:47] !log mobileapps deploy failed on canary, rolled back [17:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:00] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 and stat1005 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) 05Open→03Resolved a:03elukey stat1005 is back, John and Papaul switched it to port /43. [17:33:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add beta configuration for Wikibase reference formatting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [17:35:14] (03PS1) 10Dzahn: site/conftool: remove mw1238 through mw1243 [puppet] - 10https://gerrit.wikimedia.org/r/580384 (https://phabricator.wikimedia.org/T247780) [17:37:12] (03CR) 10Michael Große: Add beta configuration for Wikibase reference formatting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [17:38:55] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Anomie) >>! In T247788#5976390, @jcrespo wrote: > The double key write you describe was unknown to me, So a quick background: The parse has... [17:40:08] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:17] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:56] (03CR) 10RLazarus: [C: 03+1] site/conftool: remove mw1238 through mw1243 [puppet] - 10https://gerrit.wikimedia.org/r/580384 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [17:45:38] Krinkle: did you cherry pick this for wmf.24 too? the branch is cut [17:46:05] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: [[gerrit:580352|Do not lock rows when there's no term returned (T247553 T246898)]] (duration: 01m 07s) [17:46:07] Amir1: I think it landed before that [17:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:11] T246898: Wikibase\Repo\Content\DataUpdateAdapter::doUpdate: Commit failed on server(s) 10.64.48.172: Cannot execute query from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate while transaction status is ERROR - https://phabricator.wikimedia.org/T246898 [17:46:20] confirmed [17:47:26] Krinkle: live in mwdebug1001 [17:47:53] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10elukey) [17:49:04] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10elukey) [17:50:20] checking [17:51:55] (03PS1) 10EBernhardson: cirrus: Increase commonswiki near match weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580394 (https://phabricator.wikimedia.org/T245642) [17:52:07] !log warming up cache for Q70M to Q80M for new term store on db1111, db1126, db1104, db1092 (T219123) [17:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:13] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [17:52:20] Amir1: lgtm [17:53:19] jynus addshore: the deadlocks are gone [17:53:24] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw123[89].eqiad.wmnet [17:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:29] Krinkle: okay, syncing [17:53:47] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw124[0-3].eqiad.wmnet [17:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:57] Amir1: awesome [17:54:58] Amir1: so far so good: https://grafana.wikimedia.org/d/000000273/mysql?from=1584456883609&to=1584467683612&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104&fullscreen&panelId=19&orgId=1 [17:55:03] (03PS5) 10Mstyles: kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) [17:55:03] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:25] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@b6bff94]: Update mobileapps to 3c73ca3 [17:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:30] (03CR) 10Mstyles: kibana: refactor kibana role to kibana profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:56:04] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.23/languages/LanguageConverter.php: [[gerrit:580361|languages: Don't assume in LanguageConverter (T235360)]] (duration: 01m 07s) [17:56:08] Krinkle: ^ [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:08] T235360: Page view fatal from LanguageConverter: "Call to a member function isSafeToLoad() on null" - https://phabricator.wikimedia.org/T235360 [17:56:15] thx [17:57:15] 10Operations: migrate racktables to a buster VM (was: decom racktables?) - https://phabricator.wikimedia.org/T247646 (10Dzahn) [17:57:41] I make the patch for Q80M but wait until another wave of creations pass [17:58:33] 10Operations: migrate racktables to a buster VM (was: decom racktables?) - https://phabricator.wikimedia.org/T247646 (10Dzahn) >> Faidon wrote: > Hope this all makes sense :) >> Moritz wrote: > Buster will be unproblematic Yes, it does. Thanks for detailed responses. I'll try to move it to another buster VM... [17:58:37] (03PS1) 10Ladsgroup: Set up read new term store up to Q80M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580397 (https://phabricator.wikimedia.org/T219123) [17:58:50] 10Operations: migrate racktables to a buster VM (was: decom racktables?) - https://phabricator.wikimedia.org/T247646 (10Dzahn) a:03Dzahn [17:58:55] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:32] RECOVERY - mediawiki-installation DSH group on mw1280 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:59:56] Warming up the cache would take an hour anyway [18:00:03] jouncebot: now [18:00:03] For the next 0 hour(s) and 59 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1800) [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1800) [18:00:05] cscott, cscott, cscott, and cscott: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:27] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [18:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:31] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@b6bff94]: Update mobileapps to 3c73ca3 (duration: 06m 06s) [18:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:05] 10Operations: migrate racktables to a buster VM (was: decom racktables?) - https://phabricator.wikimedia.org/T247646 (10Dzahn) [18:02:07] 10Operations, 10serviceops: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [18:02:49] 10Operations, 10serviceops: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [18:04:23] (03CR) 10Elukey: [C: 03+1] "Re-ran puppet compiler: https://puppet-compiler.wmflabs.org/compiler1002/21469/" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [18:05:15] (03CR) 10Elukey: [C: 03+1] "Gehel / Herron - should we coordinate on deployment of this? It should be a no-op but probably best to disable puppet on the target hosts " [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [18:09:49] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 (10Dzahn) [18:10:17] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 (10Dzahn) p:05Triage→03Low [18:11:37] another wave is coming and still no deadlock [18:11:57] yay [18:12:00] thanks, Amir1 [18:12:25] we can now drop wb_terms \o/ probably next week [18:12:27] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) @Jdforrester-WMF I noticed the "metal" in the ticket title after adding VMs. But i guess i would argue that metal or VM does not matter for the need to upgrade and security. [18:12:59] remember: no rush [18:13:10] new problems will appear [18:13:14] and that is ok [18:16:30] i am removing a few more (oooold) eqiad appservers from production [18:16:39] but not too many at a time to be careful.. like 6 [18:16:59] to further reduce power usage [18:20:16] jynus: I agree but I highly doubt anything would arise, 95% of the reads are already on the new term store [18:20:32] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Dwisehaupt) Updates to iptables rulesets completed. frnetmon1001 added to configs where bismuth is present. [18:20:47] don't say that- now you have jinxed it and we will have issues! [18:20:49] :-D [18:20:54] haha [18:21:03] knock on the woord [18:21:05] *wood [18:28:33] jouncebot: now [18:28:33] For the next 0 hour(s) and 31 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1800) [18:28:43] is a deploy ongoing or over? [18:29:15] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [18:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:59] looks like deploy did not start because patches in swat are unmerged [18:32:31] ok with me, i can go ahead then [18:33:33] (03PS2) 10Dzahn: site/conftool: remove mw1238 through mw1243 [puppet] - 10https://gerrit.wikimedia.org/r/580384 (https://phabricator.wikimedia.org/T247780) [18:35:12] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw124[0-3].eqiad.wmnet [18:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:25] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw123[8-9].eqiad.wmnet [18:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:54] (03CR) 10Dzahn: [C: 03+2] site/conftool: remove mw1238 through mw1243 [puppet] - 10https://gerrit.wikimedia.org/r/580384 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [18:37:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:38:25] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) >>! In T247045#5976882, @Dzahn wrote: > @Jdforrester-WMF I noticed the "metal" in the ticket title after adding VMs. But i guess i would argue that metal or VM does not matt... [18:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:29] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1238-1239].eqiad.wmnet` - mw1238.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [18:38:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:32] !log removing mw1238 through mw1243 - decom with cookbook (T247780 T245099) [18:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:37] T247780: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 [18:41:44] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [18:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:51] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1240-1243].eqiad.wmnet` - mw1240.eqiad.wmnet (**FAIL**) - Host steps raised exception... [18:42:42] volans: decom cookbook worked twice just fine. despite what it says :) [18:43:18] first one with 2 hosts was actually exit code 0 and the second was only because of me failing to paste mgmt password [18:43:19] jouncebot: next [18:43:19] In 0 hour(s) and 16 minute(s): Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1900) [18:43:34] mutante: but it seems to have skipped a host [18:43:41] so "empty mgmt password" made it look bad when it actually worked fine [18:43:46] after i pasted it right [18:43:50] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10JJMC89) [18:43:53] https://phabricator.wikimedia.org/T247780#5977088 [18:44:37] volans: ah, yea, it skipped. i see [18:45:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:13] pebcak [18:45:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:42] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1240.eqiad.wmnet` - mw1240.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found... [18:45:50] all is well with the cookbook, yep [18:50:03] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [18:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:40] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:53:08] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.24/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: [[gerrit:580390|Do not lock rows when there's no term returned (T247553 T246898)]], To catch the train (duration: 01m 08s) [18:53:18] this is noop ^ [18:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:42] The cache is warm now, moving forward [18:54:45] jouncebot: now [18:54:45] For the next 0 hour(s) and 5 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1800) [18:55:07] (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q80M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580397 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [18:55:34] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:56:04] (03Merged) 10jenkins-bot: Set up read new term store up to Q80M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580397 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [18:56:06] (03PS3) 10DannyS712: Consolidate user rights assignments, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562396 (https://phabricator.wikimedia.org/T239771) [18:56:18] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:57:32] (03CR) 10Herron: [C: 03+1] "> Gehel / Herron - should we coordinate on deployment of this? It" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [19:00:04] hashar and twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T1900). [19:00:28] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q80M (T219123)]] (duration: 01m 07s) [19:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:34] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [19:01:12] (03PS1) 10DannyS712: Consolidate user rights assignments, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580414 (https://phabricator.wikimedia.org/T239771) [19:01:25] (03Abandoned) 10Herron: elasticsearch: add max_clause_count setting [puppet] - 10https://gerrit.wikimedia.org/r/576967 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [19:01:40] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q80M (T219123)]], take II (duration: 01m 06s) [19:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:57] (03PS2) 10DannyS712: Consolidate user rights assignments, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580414 (https://phabricator.wikimedia.org/T239771) [19:05:25] (03PS3) 10Andrew Bogott: Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) [19:05:27] (03PS3) 10Andrew Bogott: glance: move python.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) [19:05:29] (03PS3) 10Andrew Bogott: designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) [19:05:55] (03PS2) 10Dzahn: remove production IPs of mw1221 through mw1226 [dns] - 10https://gerrit.wikimedia.org/r/580107 (https://phabricator.wikimedia.org/T247780) [19:06:58] (03PS3) 10Dzahn: remove production IPs of mw1221 through mw1226 [dns] - 10https://gerrit.wikimedia.org/r/580107 (https://phabricator.wikimedia.org/T247780) [19:10:46] (03CR) 10RLazarus: [C: 03+1] remove production IPs of mw1221 through mw1226 [dns] - 10https://gerrit.wikimedia.org/r/580107 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [19:11:55] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Sure, that works for me @Marostegui . Feel free to shoot open a dc-ops task and assign to @Jclark-ctr . Thanks, Willy [19:12:17] (03PS1) 10Ladsgroup: Read from the new term store everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580416 (https://phabricator.wikimedia.org/T219123) [19:15:49] (03PS1) 10Dzahn: DHCP: remove mw1238 through mw1243 [puppet] - 10https://gerrit.wikimedia.org/r/580417 (https://phabricator.wikimedia.org/T247780) [19:19:26] (03PS6) 10DannyS712: trwiki: Grant interface editors editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) [19:20:08] (03CR) 10CDanis: [C: 03+1] "Thresholds look reasonable, judging by the last 4 hours or so" [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [19:21:38] (03CR) 10Dzahn: [C: 03+2] remove production IPs of mw1221 through mw1226 [dns] - 10https://gerrit.wikimedia.org/r/580107 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [19:22:18] (03CR) 10Dzahn: [C: 03+2] DHCP: remove mw1238 through mw1243 [puppet] - 10https://gerrit.wikimedia.org/r/580417 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [19:22:27] (03PS2) 10Dzahn: DHCP: remove mw1238 through mw1243 [puppet] - 10https://gerrit.wikimedia.org/r/580417 (https://phabricator.wikimedia.org/T247780) [19:27:18] (03PS1) 10Dzahn: remove production IPs of mw1238 through mw1243 [dns] - 10https://gerrit.wikimedia.org/r/580418 (https://phabricator.wikimedia.org/T247780) [19:38:34] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8db09ed]: Various PCS endpoints additions and fixes T247295 T247096 T244175 [19:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:43] T244175: [Bug] mobile-html-offline-resources endpoint is 404ing - https://phabricator.wikimedia.org/T244175 [19:38:43] T247295: Add content type check to mobile-html and mobile-html-offline-resources to enforce the latest version - https://phabricator.wikimedia.org/T247295 [19:38:43] T247096: Expose new PCS i18n endpoint - https://phabricator.wikimedia.org/T247096 [19:39:38] (03PS2) 10Dzahn: iegreview: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579677 [19:48:29] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:50:01] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Jdforrester-WMF) [19:50:06] 10Operations, 10MediaWiki-Debug-Logger, 10observability, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar): MWExceptionHandler reqId sometimes differs from php-wmerrors reqId - https://phabricator.wikimedia.org/T247786 (10AMooney) p:05Triage→03Medium a:03holger.knust [19:51:39] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:51:51] (03CR) 10Jhedden: "I think this will also require updating glance-api.conf" [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [19:51:51] !log miscweb1001 - testing if ferm 80 firewall hole is needed for envoy, temp. disabled puppet, restarted ferm [19:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:06] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8db09ed]: Various PCS endpoints additions and fixes T247295 T247096 T244175 (duration: 14m 31s) [19:53:07] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:12] T244175: [Bug] mobile-html-offline-resources endpoint is 404ing - https://phabricator.wikimedia.org/T244175 [19:53:13] T247295: Add content type check to mobile-html and mobile-html-offline-resources to enforce the latest version - https://phabricator.wikimedia.org/T247295 [19:53:13] T247096: Expose new PCS i18n endpoint - https://phabricator.wikimedia.org/T247096 [19:54:15] !log miscweb1001 - restarted ferm, reverted live hack [19:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:27] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:54:42] ^^ will get back to normal in a minute [19:54:48] thanks Pchelolo [19:55:04] just an artifact of a rolling deploy [19:55:56] ok [19:56:20] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:56:21] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:57:17] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:57:45] (03PS3) 10Dzahn: iegreview: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579677 [19:58:03] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:58:25] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:58:29] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:58:39] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:58:48] (03CR) 10Dzahn: [C: 03+2] "this won't do anything because port 80 is already open to caches from the racktables rule on the same host" [puppet] - 10https://gerrit.wikimedia.org/r/579677 (owner: 10Dzahn) [20:03:16] (03PS4) 10Andrew Bogott: Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) [20:03:18] (03PS4) 10Andrew Bogott: glance: move python.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) [20:03:20] (03PS4) 10Andrew Bogott: designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) [20:04:19] (03PS4) 10Dzahn: iegreview: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579677 [20:06:17] (03CR) 10Jhedden: [C: 03+1] designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [20:07:17] (03CR) 10Jhedden: "Commit message should say policy.json, other than that looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [20:08:39] (03CR) 10Jhedden: [C: 03+1] Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [20:12:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:13:33] Telia again [20:13:37] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:14:02] looks like the SWAT deploy didn't happen? [20:14:25] cscott: yea, i noticed that too [20:14:25] (03CR) 10Jhedden: [C: 03+1] "Great idea, looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [20:15:01] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Resolved→03Open This went down again today at 20:12 UTC per Icinga [20:16:07] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 20:12 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: htt... [20:18:44] 10Operations, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10AMooney) Untagging because there is nothing for CPT to do at this point. Anomie will stay subscribed and will retag if needed. [20:22:04] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:27:02] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) mailed Telia about it [20:27:44] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:31:39] !log boron - had degraded systemd state in Icinga - systemctl start docker-reporter-base-images [20:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:31] !log boron - systemctl start docker-reporter-k8s-images ; systemctl start docker-reporter-releng-images [20:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:57] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:28] (03PS3) 10Jforrester: Re-enable DiscussionTools for everyone on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580383 (https://phabricator.wikimedia.org/T247802) (owner: 10Bartosz Dziewoński) [20:36:47] (Beta-only config change going out.) [20:37:02] (03CR) 10Jforrester: [C: 03+2] Re-enable DiscussionTools for everyone on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580383 (https://phabricator.wikimedia.org/T247802) (owner: 10Bartosz Dziewoński) [20:37:55] (03Merged) 10jenkins-bot: Re-enable DiscussionTools for everyone on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580383 (https://phabricator.wikimedia.org/T247802) (owner: 10Bartosz Dziewoński) [20:38:41] MatmaRex: Done. [20:38:54] oh, thanks! [20:39:13] (03PS5) 10Andrew Bogott: glance: move policy.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) [20:39:16] (03PS5) 10Andrew Bogott: designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) [20:40:12] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10AMooney) [20:41:01] @bang 72 [20:50:19] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/WikimediaEditorTasks: Fix revert counting for non-language-specific counters, take 2 (T244974) (duration: 01m 12s) [20:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:35] T244974: Connect image tag contributions to Suggested edits profile stats - https://phabricator.wikimedia.org/T244974 [21:02:21] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) Telia: We have opened case 01134611 and are currently investigating. [21:03:06] (03PS1) 10Mholloway: WikimediaEditorTasks: Enable depicts counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580470 (https://phabricator.wikimedia.org/T247874) [21:05:49] (03CR) 10Jhedden: [C: 03+1] glance: move policy.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [21:06:55] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Enable depicts counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580470 (https://phabricator.wikimedia.org/T247874) (owner: 10Mholloway) [21:07:53] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Enable depicts counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580470 (https://phabricator.wikimedia.org/T247874) (owner: 10Mholloway) [21:08:23] * Krinkle switches from testing on mwdebug1002 to testing on mwdebug1001 [21:09:21] sorry for any confusion mdholloway [21:10:29] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:10:40] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable Depicts counting (T247874) (duration: 01m 07s) [21:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:46] T247874: SE: Activate Depicts counting in production - https://phabricator.wikimedia.org/T247874 [21:11:19] Krinkle: sorry, didn't realize you were doing testing! [21:12:20] np, I didn't say so in IRC [21:13:47] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:15:28] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable Depicts counting (again) (T247874) (duration: 01m 07s) [21:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:51] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:16:09] (done) [21:16:47] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:17:11] so for train blocker hmm [21:17:20] I am trying to find some time range the issue strated on beta [21:17:22] brennen: ^ :) [21:19:52] it has mediawiki debug logs since March 2nd, specially for the MessageCache log bucket [21:19:57] so I am digging into those [21:20:30] (03PS1) 10Dzahn: misc_apps: add firewall rule to let envoy talk to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/580479 [21:21:26] (03CR) 10jerkins-bot: [V: 04-1] misc_apps: add firewall rule to let envoy talk to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/580479 (owner: 10Dzahn) [21:29:40] (03PS2) 10Dzahn: misc_apps: add firewall rule to let envoy talk to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/580479 [21:30:48] (03CR) 10jerkins-bot: [V: 04-1] misc_apps: add firewall rule to let envoy talk to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/580479 (owner: 10Dzahn) [21:33:38] * Krinkle testing on mw2170 (codfw, appserver) [21:34:33] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:35:39] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:37:04] my finding is that on beta the cahce issue started around 2020-03-05 9:00 UTC give or take 15 minutes or so [21:37:59] @steward https://incubator.wikimedia.org/wiki/Special:Contributions/31.217.20.211 is the same lta [21:38:03] Sorry, wrong channel [21:38:13] lol [21:38:37] 10Operations: Is invite-wmfall@wikimedia.org a Mailman list? - https://phabricator.wikimedia.org/T247848 (10Dzahn) No, it's not a mailman list if "lists" is not part of the address. [21:42:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:42:54] well, ok. i was about to call that resolved but looks like Telia is working on a splice or something [21:43:07] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:44:29] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10colewhite) Good idea forking the original task. Thanks for that! > I 'd say yes. We can draft a plan to make sure we... [21:45:57] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10colewhite) p:05Triage→03Medium [21:45:59] 10Operations: Is invite-wmfall@wikimedia.org a Mailman list? - https://phabricator.wikimedia.org/T247848 (10Dzahn) invite-wmfall@wikimedia.org is undeliverable: Address invite-wmfall@wikimedia.org does not exist [21:49:43] PROBLEM - PHP opcache health on mw2170 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:50:32] (03CR) 10Cwhite: prometheus: add icinga average latency checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [21:54:48] !log krinkle@mw2170$ disable-puppet (Testing for T99740) [21:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:54] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [21:55:29] anyone mind if I switch mwdebug1001 to wmf.24? [21:55:38] (if I find out how to do that) [22:03:40] hashar: Krinkle is testing things in prod right now. [22:04:02] hashar: You'd need to manually edit /srv/mediawiki/wikiversions.php on the server directly. [22:04:10] But maybe wait a sec. [22:04:16] yeah what I thought [22:04:28] but then I found out that I can just alternate between different wikis [22:04:29] RECOVERY - PHP opcache health on mw2170 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:04:33] anyway MessageCache::load: Loading en... local cache has the wrong hash, got from global cache [22:04:39] that shows up with mwdebug1001 [22:04:47] `sudo -u wikidev nano /srv/mediawiki/wikiversions.php` would do it. [22:05:14] As long as it's not puppet-related, I don't mind. [22:05:24] I'm on a codfw server now [22:06:48] on mwdebug1001 I trigger the "local cache has the wrong hash, got from global cache" as needed [22:09:51] PROBLEM - PHP7 rendering on mw2170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1309 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:10:35] I just have to browse to another wiki apparently [22:10:47] so there must be something iin the cache that varies somehow :/ [22:11:41] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@0adead4]: Update mobileapps to ec6fd6e [22:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:59] RECOVERY - PHP7 rendering on mw2170 is OK: HTTP OK: HTTP/1.1 200 OK - 75293 bytes in 0.348 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:12:59] PROBLEM - PHP opcache health on mw2170 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:17:49] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@0adead4]: Update mobileapps to ec6fd6e (duration: 06m 08s) [22:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:19] RECOVERY - PHP opcache health on mw2170 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:19:46] Krinkle: brennen: I am too tired to follow up. Hopefully that "local cache has the wrong hash," might lead to somewhere [22:20:07] thanks for investigating hashar. [22:20:29] a couple of hours of investigation and I don't even have a firm root cause :-\ [22:20:32] it is definitely a good time to not be working in france. [22:20:47] but at least it is easy to trigger the issue on mwdebug1001 [22:21:24] just switching between two wikis seems sufficient [22:21:58] it might also be happening right now in production, only to exacerbate when enwiki is being switched [22:22:04] that is all I got for tonight [22:23:54] (03PS3) 10RhinosF1: Add wikimedia cloud mailing list to mailman’s robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/563684 (https://phabricator.wikimedia.org/T242520) [22:29:14] (03CR) 10Dzahn: [C: 03+2] remove production IPs of mw1238 through mw1243 [dns] - 10https://gerrit.wikimedia.org/r/580418 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [22:49:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Deployment services): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10thcipriani) This one: ` Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-updat... [22:49:25] !log warming up cache for Q80M to Q88M for new term store on db1111, db1126, db1104, db1092 (T219123) [22:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:31] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [22:59:25] (03CR) 10Dwisehaupt: [C: 03+1] "Looks ok to me. Just may want to update the comment line as noted if the checks follow the master." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580351 (owner: 10Jgreen) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200317T2300). [23:00:04] wikimedia/DannyS712 and wikimedia/DannyS712: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:32] wat [23:00:44] (03PS11) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [23:00:46] wrong ircnick template it seems [23:00:47] Thats me, sorry, I didn't know how to properly fill in the template [23:01:27] (03PS3) 10DannyS712: Consolidate user rights assignments, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580414 (https://phabricator.wikimedia.org/T239771) [23:01:31] DannyS712: see https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1860478&oldid=1860477 for my fix [23:02:57] 2 patches are at https://gerrit.wikimedia.org/r/#/q/hashtag:%22swat-2020-03-17-evening%22 - please make sure part 1 is merged first [23:04:58] (03CR) 10CRusnov: [C: 03+2] tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [23:05:55] (03PS2) 10CRusnov: Update Netbox to v2.7.10-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 [23:13:47] RoanKattouw Niharika Urbanecm is the swat deploy going to happen? [23:17:19] DannyS712: since James_F commented at the task, I would prefer if he could review/handle that. [23:18:47] The comments were from 3 months ago (I forgot about the task) so I'm not sure if he remembers it [23:19:53] either way, I'm uncomfortable with pushing this out myself. [23:20:05] okay, understood [23:22:54] (03CR) 10Dzahn: "please keep in mind that icinga host groups have to exist / be created before this is merged or Icinga config will break not finding them " [puppet] - 10https://gerrit.wikimedia.org/r/580351 (owner: 10Jgreen) [23:26:22] (03CR) 10Dzahn: nsca_frack.cfg.erb - merge some groups, add fran1001, clean up format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580351 (owner: 10Jgreen) [23:32:12] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) "Kindly be informed that your circuit involved in major outage in our network , suspecting faulty card , investigation is ongoing. Major case 01134704." [23:32:52] 10Operations: Is invite-wmfall@wikimedia.org a Mailman list? - https://phabricator.wikimedia.org/T247848 (10Dzahn) I think what it is is a Google people group and not really an email address. [23:37:35] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 (owner: 10CRusnov) [23:41:26] (03PS1) 10CRusnov: deploy-check.py: Fix handling of subdirs in zones directory [dns] - 10https://gerrit.wikimedia.org/r/580520 [23:42:23] (03CR) 10CRusnov: "A critical fix for the untested path with --deploy" [dns] - 10https://gerrit.wikimedia.org/r/580520 (owner: 10CRusnov) [23:44:03] (03PS2) 10CRusnov: deploy-check.py: Fix handling of subdirs in zones directory [dns] - 10https://gerrit.wikimedia.org/r/580520 [23:45:43] (03CR) 10CRusnov: [C: 03+2] "Self merging because authdns merges are currently broken." [dns] - 10https://gerrit.wikimedia.org/r/580520 (owner: 10CRusnov) [23:48:21] (03PS2) 10CRusnov: authdns-local-update: Plumb in netbox snippet dir [puppet] - 10https://gerrit.wikimedia.org/r/580371 [23:48:31] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Dwisehaupt) Migrated nessus code, config and reports from bismuth to frnetmon1001 using the instructions here: https://community.tenable.com/s/article/Migrating-Nessus-to-new-S... [23:50:07] (03PS3) 10Dzahn: misc_apps: add firewall rule to let envoy talk to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/580479 [23:51:43] (03CR) 10CRusnov: [C: 03+2] authdns-local-update: Plumb in netbox snippet dir [puppet] - 10https://gerrit.wikimedia.org/r/580371 (owner: 10CRusnov) [23:56:47] (03PS2) 10Volans: gen-zones: transliterate commit message to ASCII [dns] - 10https://gerrit.wikimedia.org/r/579586 [23:58:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets