[00:04:03] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01064 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:16:09] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards: 931, active_shards_percent_as_number: 100.0, number_of_nodes: 6, cluster_name: production-logstash-eqiad, initializing_shards: 0, status: green, unassigned_shards: 0, delayed_unassigned_shards: 0, active_primary_shards: 488, n [00:16:09] es: 3, timed_out: False, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:16:43] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:23] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:24:43] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:33:57] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36915464 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:11] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 92104528 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:47] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20035104 and 280 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:51] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 859155160 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:57] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 815900040 and 351 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:17] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 446209472 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:29] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 490729712 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:57] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1086456384 and 410 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:27] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 87438360 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:31] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1192289464 and 68 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:03] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 841106296 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:15] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49038224 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:55] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 470270448 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:41] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 269340904 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:43] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1371635808 and 62 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:05] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 336631616 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:37] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1012800008 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:53] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 137680 and 92 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:05] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1624 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:09] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 154176 and 108 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:09] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 154176 and 108 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:15] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1440200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:33] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 114944 and 132 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:45] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 145 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:55] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 39896 and 155 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:43] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1776280 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:49] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4832 and 207 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:57] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 644676976 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:27] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 66408 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:37] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 13464 and 100 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:05] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 15248 and 129 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:23] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 37320 and 147 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:23] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24040 and 147 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:57:43] (03CR) 10DannyS712: [C: 03+1] Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) (owner: 10Ammarpad) [04:26:49] (03PS1) 10Andrew Bogott: Glance: fix copy/paste mistake in logging class name [puppet] - 10https://gerrit.wikimedia.org/r/645721 [04:26:51] (03PS1) 10Andrew Bogott: Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) [04:28:01] (03CR) 10Andrew Bogott: [C: 03+2] Glance: fix copy/paste mistake in logging class name [puppet] - 10https://gerrit.wikimedia.org/r/645721 (owner: 10Andrew Bogott) [04:28:26] (03CR) 10jerkins-bot: [V: 04-1] Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [04:32:34] (03PS2) 10Andrew Bogott: Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) [04:46:27] (03PS1) 10Andrew Bogott: Dummy db passwords for Cinder [labs/private] - 10https://gerrit.wikimedia.org/r/645732 [04:46:52] (03PS3) 10Andrew Bogott: Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) [04:47:07] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Dummy db passwords for Cinder [labs/private] - 10https://gerrit.wikimedia.org/r/645732 (owner: 10Andrew Bogott) [04:51:54] (03PS4) 10Andrew Bogott: Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) [05:00:56] (03PS5) 10Andrew Bogott: Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) [05:04:24] (03PS6) 10Andrew Bogott: Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) [05:08:29] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01002 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [05:19:38] (03PS1) 10Andrew Bogott: Glance: add a hiera setting for the glance ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/645746 (https://phabricator.wikimedia.org/T263461) [05:20:39] (03PS2) 10Andrew Bogott: Glance: add a hiera setting for the glance ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/645746 (https://phabricator.wikimedia.org/T263461) [05:23:02] (03PS3) 10Andrew Bogott: Glance: add a hiera setting for the glance ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/645746 (https://phabricator.wikimedia.org/T263461) [05:26:46] (03CR) 10Andrew Bogott: [C: 03+2] Glance: add a hiera setting for the glance ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/645746 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [05:31:27] (03PS7) 10Andrew Bogott: Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) [06:02:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:32:03] PROBLEM - ores on ores1002 is CRITICAL: connect to address 10.64.0.52 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:39:31] RECOVERY - ores on ores1002 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 4.674 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201206T0800) [10:44:19] (03PS1) 10Ladsgroup: presto: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645833 (https://phabricator.wikimedia.org/T209953) [11:10:14] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26981/" [puppet] - 10https://gerrit.wikimedia.org/r/645833 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [11:17:38] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26982/console" [puppet] - 10https://gerrit.wikimedia.org/r/645833 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [11:20:52] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/645833 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [12:25:59] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 109566184 and 241 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:23] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17870704 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:27] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1458632 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:29:03] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 147187280 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:29:17] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26396184 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:29:53] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 263109808 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:11] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 571166840 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:25] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 571496424 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:35] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 652187696 and 48 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:23] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 512042712 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:57] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 69640 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:07] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:07] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:19] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 79 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:55] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 7776 and 93 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:55] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4184 and 114 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:55] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 33528 and 173 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:11] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 190 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:27:08] (03CR) 10Dzahn: [C: 03+1] "thanks for the additional details in the comments. appreciated. I agree it's not worth spending much more time on this. Basically will say" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [19:05:37] (03PS1) 10Brian Wolff: Add PoolCounter settings for DPL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645994 (https://phabricator.wikimedia.org/T263220) [19:31:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:33:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:29:29] PROBLEM - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:23:51] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [22:10:27] RECOVERY - MariaDB Replica Lag: pc1 on pc2007 is OK: OK slave_sql_lag Replication lag: 53.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica