[00:01:31] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:32] (03PS1) 10Dzahn: site: apply doc role on doc2001 and unify role stanza [puppet] - 10https://gerrit.wikimedia.org/r/650623 (https://phabricator.wikimedia.org/T247653) [00:05:50] (03PS2) 10Dzahn: site: apply doc role on doc2001 and unify role stanza [puppet] - 10https://gerrit.wikimedia.org/r/650623 (https://phabricator.wikimedia.org/T247653) [00:06:09] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:55] (03CR) 10Dzahn: [C: 03+2] site: apply doc role on doc2001 and unify role stanza [puppet] - 10https://gerrit.wikimedia.org/r/650623 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [00:07:27] (03CR) 10Dzahn: "This is even already on the TLS cert used by envoy." [puppet] - 10https://gerrit.wikimedia.org/r/650623 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [00:15:36] (03PS1) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) [00:15:48] (03Merged) 10jenkins-bot: pipeline: Fix malformed pipeline config [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650567 (owner: 10Dduvall) [00:16:05] (03PS1) 10Dduvall: pipeline: Increase fetch depth to 50 [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650573 [00:16:14] (03CR) 10Dduvall: "check experimental" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650573 (owner: 10Dduvall) [00:16:52] (03PS2) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) [00:18:18] (03PS2) 10Dzahn: scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 [00:18:31] (03CR) 10jerkins-bot: [V: 04-1] scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (owner: 10Dzahn) [00:20:38] (03PS3) 10Dzahn: scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 [00:26:23] (03PS1) 10Dzahn: add discovery-geo name and resources for doc [dns] - 10https://gerrit.wikimedia.org/r/650626 [00:28:56] (03PS1) 10Dzahn: add doc to misc services with multiple backends [dns] - 10https://gerrit.wikimedia.org/r/650628 (https://phabricator.wikimedia.org/T247653) [00:31:06] (03CR) 10Dduvall: "check experimental" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650567 (owner: 10Dduvall) [00:33:19] (03PS1) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [00:34:18] (03PS2) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [00:35:39] (03PS1) 10Dzahn: move misc services with single backend to own section [dns] - 10https://gerrit.wikimedia.org/r/650630 [00:35:59] (03PS3) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [00:38:25] (03Abandoned) 10Cwhite: profile: add verify-filters script [puppet] - 10https://gerrit.wikimedia.org/r/602727 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [00:38:56] (03CR) 10Dduvall: "check experimental" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/650573 (owner: 10Dduvall) [00:39:03] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 161022864 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:03] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2936016848 and 201 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:23] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 62320 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:23] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5733957648 and 455 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:03] (03PS1) 10Dzahn: puppet_compiler: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650631 [00:41:17] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 895114544 and 203 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:17] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 554532864 and 186 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:31] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650631 (owner: 10Dzahn) [00:42:41] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1224 and 242 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:41] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 50840 and 243 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:05] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 21200 and 266 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:13] (03PS1) 10Dzahn: prometheus:node_exporter: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650632 [00:43:43] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27219/" [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [00:44:33] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 47728 and 353 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:23] (03PS12) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [00:56:41] (03PS1) 10Mstyles: update flink config with swift and other values [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) [00:57:03] (03PS1) 10Dzahn: pybaltest: convert to role/profile, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650634 [00:58:37] (03PS1) 10Dzahn: swap: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650635 [01:00:06] (03CR) 10jerkins-bot: [V: 04-1] swap: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650635 (owner: 10Dzahn) [01:04:31] (03PS1) 10Dzahn: graphoid: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650636 [01:06:52] (03PS1) 10Dzahn: mariadb::maintenance: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650637 [02:37:44] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/647032 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [02:37:58] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [03:56:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:58:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:35:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (doc2001), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:35:40] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:28:30] (03CR) 10Elukey: [C: 03+2] druid: Migrate hiera() to lookup() and setting datatype in middlemanager [puppet] - 10https://gerrit.wikimedia.org/r/650617 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:49:48] PROBLEM - WDQS HTTP on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:49:56] PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201219T0800) [08:24:26] PROBLEM - SSH on wdqs1011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:25:56] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:41:08] PROBLEM - SSH on wdqs1011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:42:40] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:44] PROBLEM - SSH on wdqs1011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:00] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:17:32] PROBLEM - SSH on wdqs1011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:19:04] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:23:36] PROBLEM - SSH on wdqs1011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:42:10] 10Operations, 10Diff-blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Aklapper) [10:02:02] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:11:28] PROBLEM - SSH on wdqs1011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:31:40] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:36:12] PROBLEM - SSH on wdqs1011 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:15:48] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:17:20] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:25:58] 10Operations, 10Growth-Team, 10Mail, 10Notifications, and 2 others: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10Aklapper) a:05nettrom_WMF→03None Removing task assignee due to inactivity, as this open tas... [11:29:30] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10observability, and 3 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10Aklapper) a:05dduvall→03None Removing task assignee due to inactivity, as this open task has... [11:30:26] 10Operations, 10Mail: Implement MTA-STS - https://phabricator.wikimedia.org/T203883 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please a... [11:30:42] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10Aklapper) a:05jijiki→03None Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two year... [11:31:12] 10Operations, 10Mail, 10Patch-Needs-Improvement, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the email... [11:31:54] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: investigate caching of mailman listinfo pages - https://phabricator.wikimedia.org/T197819 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the email... [11:32:24] 10Operations, 10Traffic: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521 (10Aklapper) a:05Vgutierrez→03None Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the ema... [11:34:01] 10Operations, 10Commons: Improve mwmaint servers (e.g. mwmain1001) userland to process server side uploads - https://phabricator.wikimedia.org/T159661 (10Aklapper) a:05Dereckson→03None Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (... [11:35:16] 10Puppet, 10Toolforge, 10Documentation, 10User-srodlund: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10Aklapper) a:05srodlund→03None Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails se... [12:31:09] Daimona: hi [12:31:30] Hey [12:31:41] I'll talk in -tech better [12:31:45] if that's okay [13:58:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_wikifeeds_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:00:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:17:41] 10Operations, 10Gerrit: Gerrit is OOMing - https://phabricator.wikimedia.org/T270451 (10hashar) 05Resolved→03Open This issue is still ongoing, we had Gerrit going OOM with the exact same symptom this week. The next steps are: * switch it from Java 8 to Java 11 * upgrade to Gerrit 3.3 [15:19:59] 10Operations, 10Gerrit: Gerrit is OOMing - https://phabricator.wikimedia.org/T270451 (10hashar) [15:20:08] 10Operations, 10Gerrit: Gerrit is OOMing - https://phabricator.wikimedia.org/T270451 (10hashar) Sorry I have mistaken this task with T263008 :D [15:28:40] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:30:08] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:38:27] (03PS1) 10Andrew Bogott: Keystone: disable logging to /var/log/keystone/ [puppet] - 10https://gerrit.wikimedia.org/r/650776 (https://phabricator.wikimedia.org/T269419) [16:57:22] (03PS2) 10Andrew Bogott: Keystone: disable logging to /var/log/keystone/ [puppet] - 10https://gerrit.wikimedia.org/r/650776 (https://phabricator.wikimedia.org/T270554) [16:58:33] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: disable logging to /var/log/keystone/ [puppet] - 10https://gerrit.wikimedia.org/r/650776 (https://phabricator.wikimedia.org/T270554) (owner: 10Andrew Bogott)