[00:51:51] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1974301040 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:05] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4116057112 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:39] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 894609800 and 181 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:13] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 170 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:45] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1016 and 202 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:21] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 202288 and 238 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:07] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 9049073864 and 595 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:45] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5694610992 and 332 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:51] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3724200072 and 189 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:21] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5877000112 and 318 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:45] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1088 and 232 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:51] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 105424 and 238 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:01] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6152 and 368 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:29] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 395 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:51:57] 10SRE, 10vm-requests: What is the Scientific Method - https://phabricator.wikimedia.org/T271634 (10Davishca) [02:52:58] 10SRE, 10vm-requests: What is the Scientific Method - https://phabricator.wikimedia.org/T271634 (10Davishca) What is the Scientific Method? The scientific method is a method used to discover new understandings about the natural world based on making falsifiable predictions (hypotheses), testing them empiricall... [03:04:58] 10SRE, 10LDAP: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10Peachey88) [07:30:15] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 359 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:31:55] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 13 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:56:49] 10SRE, 10vm-requests: What is the Scientific Method - https://phabricator.wikimedia.org/T271634 (10DannyS712) @Aklapper or another phab admin, can you please close this? Or reset the edit policy at least? Thanks [09:41:56] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1004 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:43:37] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1004 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:53:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:57:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:00:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:33:26] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Aklapper) [13:43:37] PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:19] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:10:19] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:53] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1042 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:19:09] (03PS1) 10Andrew Bogott: OpenStack haproxy: change http service health check interval to 3s [puppet] - 10https://gerrit.wikimedia.org/r/655275 (https://phabricator.wikimedia.org/T271647) [16:22:32] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack haproxy: change http service health check interval to 3s [puppet] - 10https://gerrit.wikimedia.org/r/655275 (https://phabricator.wikimedia.org/T271647) (owner: 10Andrew Bogott) [16:57:40] (03PS1) 10Andrew Bogott: OpenStack rabbitmq: set busy wait threshold to 'none' [puppet] - 10https://gerrit.wikimedia.org/r/655277 (https://phabricator.wikimedia.org/T271647) [16:59:06] (03CR) 10jerkins-bot: [V: 04-1] OpenStack rabbitmq: set busy wait threshold to 'none' [puppet] - 10https://gerrit.wikimedia.org/r/655277 (https://phabricator.wikimedia.org/T271647) (owner: 10Andrew Bogott) [17:01:33] (03PS2) 10Andrew Bogott: OpenStack rabbitmq: set busy wait threshold to 'none' [puppet] - 10https://gerrit.wikimedia.org/r/655277 (https://phabricator.wikimedia.org/T271647) [17:09:04] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack rabbitmq: set busy wait threshold to 'none' [puppet] - 10https://gerrit.wikimedia.org/r/655277 (https://phabricator.wikimedia.org/T271647) (owner: 10Andrew Bogott) [17:14:00] (03PS1) 10Andrew Bogott: When changing rabbitmq-env.conf, notify rabbit service [puppet] - 10https://gerrit.wikimedia.org/r/655278 (https://phabricator.wikimedia.org/T271647) [17:15:26] (03CR) 10jerkins-bot: [V: 04-1] When changing rabbitmq-env.conf, notify rabbit service [puppet] - 10https://gerrit.wikimedia.org/r/655278 (https://phabricator.wikimedia.org/T271647) (owner: 10Andrew Bogott) [17:15:58] (03PS2) 10Andrew Bogott: When changing rabbitmq-env.conf, notify rabbit service [puppet] - 10https://gerrit.wikimedia.org/r/655278 (https://phabricator.wikimedia.org/T271647) [17:17:48] (03CR) 10Andrew Bogott: [C: 03+2] When changing rabbitmq-env.conf, notify rabbit service [puppet] - 10https://gerrit.wikimedia.org/r/655278 (https://phabricator.wikimedia.org/T271647) (owner: 10Andrew Bogott) [17:51:49] (03PS1) 10Majavah: Revert "Switch fiwiki to their 500k temporary logo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655281 [17:51:51] (03PS1) 10Majavah: Revert "Add fiwiki 500k temporary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 [18:49:41] (03PS5) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) [18:52:35] (03PS3) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) [18:52:41] (03PS3) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) [20:18:57] PROBLEM - SSH on logstash1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:19:09] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f8e40720518: Failed to establish a new connection: [Errno 111] Connection [20:19:09] ://wikitech.wikimedia.org/wiki/Search%23Administration [20:20:23] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:27] RECOVERY - SSH on logstash1008 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:45:27] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:55] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, active_shards: 916, delayed_unassigned_shards: 0, initializing_shards: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, number_of_nodes: 6, unassigned_shards: 0, timed_out: False, active_shards_percent_as_number: 100.0, number_of_in_flight_fetch: 0, number_of_data_nod [20:45:55] aiting_in_queue_millis: 0, number_of_pending_tasks: 0, active_primary_shards: 483 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:46] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Reedy) Is there a task (or should the be it?) for actually swapping from PHP 7.2 to... newer PHP (7.3 or whatever)?... [22:19:50] (03PS1) 10Urbanecm: Enable anniversary logo for cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655292 (https://phabricator.wikimedia.org/T271662) [22:22:38] (03PS1) 10Urbanecm: Set import sources for mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655293 (https://phabricator.wikimedia.org/T270402)