[00:27:08] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:58] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:50] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 553906800 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:50] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4764420936 and 769 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:20] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1205764104 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:34] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8147460112 and 934 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:34] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1094098976 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:22] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 638290272 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:32] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1083556056 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:54] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2572763496 and 161 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:48] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 145728 and 135 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:36] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 992 and 182 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:46] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 284632 and 193 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:08] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 217840 and 215 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:52] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 346504 and 258 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:58] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56928 and 266 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:58] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 140873880 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:26] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 65637856 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:32] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 190155144 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:32] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1178733344 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:42] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 151791368 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:46] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 433 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:02] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 507147904 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:18] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 465 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:10] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 265022856 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:14] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 720 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:50] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:18] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 23544 and 110 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:22] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26832 and 115 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:24] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1008 and 116 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:24] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 68360 and 116 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:32] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 31280 and 125 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:14] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:12] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:29] (03PS1) 10Andrew Bogott: Cinder: allow api filtering on 'bootable' [puppet] - 10https://gerrit.wikimedia.org/r/648840 (https://phabricator.wikimedia.org/T269511) [03:13:33] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: allow api filtering on 'bootable' [puppet] - 10https://gerrit.wikimedia.org/r/648840 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [03:15:54] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:17:34] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:18:00] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [03:19:52] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [03:22:32] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:24:36] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [03:24:52] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [03:25:48] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:29:34] (03PS1) 10Andrew Bogott: Horizon: update LAUNCH_INSTANCE_DEFAULTS to prepare for Cinder [puppet] - 10https://gerrit.wikimedia.org/r/648847 (https://phabricator.wikimedia.org/T269511) [03:33:47] (03PS2) 10Andrew Bogott: Horizon: update LAUNCH_INSTANCE_DEFAULTS to prepare for Cinder [puppet] - 10https://gerrit.wikimedia.org/r/648847 (https://phabricator.wikimedia.org/T269511) [03:35:44] (03PS3) 10Andrew Bogott: Horizon: update LAUNCH_INSTANCE_DEFAULTS to prepare for Cinder [puppet] - 10https://gerrit.wikimedia.org/r/648847 (https://phabricator.wikimedia.org/T269511) [03:38:03] (03PS4) 10Andrew Bogott: Horizon: update LAUNCH_INSTANCE_DEFAULTS to prepare for Cinder [puppet] - 10https://gerrit.wikimedia.org/r/648847 (https://phabricator.wikimedia.org/T269511) [03:40:52] (03PS1) 10Andrew Bogott: Glance: disable the 'file' backend in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/648852 [03:41:22] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update LAUNCH_INSTANCE_DEFAULTS to prepare for Cinder [puppet] - 10https://gerrit.wikimedia.org/r/648847 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [03:41:50] (03CR) 10Andrew Bogott: [C: 03+2] Glance: disable the 'file' backend in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/648852 (owner: 10Andrew Bogott) [03:52:48] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [03:54:12] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [03:57:03] (03PS1) 10Andrew Bogott: Glance: make glance active/active in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/648853 [03:57:05] (03PS1) 10Andrew Bogott: Remove obsolete glance image_sync code [puppet] - 10https://gerrit.wikimedia.org/r/648854 [04:02:37] (03PS1) 10Andrew Bogott: Glance: remove the glance_image_dir param [puppet] - 10https://gerrit.wikimedia.org/r/648857 [04:03:08] (03CR) 10Andrew Bogott: [C: 03+2] Glance: make glance active/active in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/648853 (owner: 10Andrew Bogott) [04:31:48] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [04:35:02] (03CR) 10Andrew Bogott: [C: 03+2] Glance: remove the glance_image_dir param [puppet] - 10https://gerrit.wikimedia.org/r/648857 (owner: 10Andrew Bogott) [04:35:20] (03CR) 10Andrew Bogott: [C: 03+2] Remove obsolete glance image_sync code [puppet] - 10https://gerrit.wikimedia.org/r/648854 (owner: 10Andrew Bogott) [04:57:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:46] (03CR) 10Jeena Huneidi: [C: 04-1] "I like the DRY idea but I think it could be a bit hard to read. The chown container also has bunch of environment variables it doesn't nee" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy) [04:58:54] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [05:01:52] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:50] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:46] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:34] PROBLEM - ores on ores2006 is CRITICAL: connect to address 10.192.32.174 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:29:46] PROBLEM - ores on ores2009 is CRITICAL: connect to address 10.192.48.90 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:31:12] RECOVERY - ores on ores2006 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:36:16] RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [07:13:38] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:54] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201213T0800) [08:43:46] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:58] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10Volans) FWIW we're getting one email every hour from rancid about this. Is there any quick way to prevent/disable them by any chance? [08:48:40] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:37] (03PS1) 10ArielGlenn: fix up ssh key entry for gtzatchkova [puppet] - 10https://gerrit.wikimedia.org/r/648970 (https://phabricator.wikimedia.org/T269930) [09:13:48] (03CR) 10ArielGlenn: "The cross-validate-accounts cron job fails without the key type being in there, even if the key does make it onto the hosts for use. While" [puppet] - 10https://gerrit.wikimedia.org/r/648970 (https://phabricator.wikimedia.org/T269930) (owner: 10ArielGlenn) [09:14:57] (03CR) 10ArielGlenn: [C: 03+2] fix up ssh key entry for gtzatchkova [puppet] - 10https://gerrit.wikimedia.org/r/648970 (https://phabricator.wikimedia.org/T269930) (owner: 10ArielGlenn) [09:44:10] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:04] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:56] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:54] (03PS1) 10ArielGlenn: add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 [14:18:17] (03CR) 10jerkins-bot: [V: 04-1] add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [14:21:29] (03CR) 10ArielGlenn: "The managers should weigh in, adding them as reviewers. Also would like Moritz's thoughts on this, in particular setting up a group that's" [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [14:23:58] (03PS2) 10ArielGlenn: add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 [15:27:26] 10Operations, 10netops: Upgrade Routinator 3000 to 0.8.2 - https://phabricator.wikimedia.org/T269738 (10ayounsi) https://www.ripe.net/ripe/mail/archives/routing-wg/2020-December/004206.html [16:26:45] 10Operations, 10Diff-blog, 10Traffic, 10HTTPS: Send HSTS header on diff.wikimedia.org - https://phabricator.wikimedia.org/T270034 (10Nintendofan885) [16:36:24] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 142 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:39:40] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 36 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:13:53] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10Volans) @jijiki given that the hosts that gets reimaged are changing interface name from ethN to enoN, we also need to run [[ https://wikit... [17:55:38] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:56:26] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) @Volans Should I run it for the ones I have already reimaged? [18:40:50] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10Volans) >>! In T213089#6687500, @jijiki wrote: > @Volans So I should run it for all of them? Should we add a notice about this on the reim... [21:06:30] (03CR) 10Krinkle: "@Bryan Thanks for the ping. I'm happy to decom the tool if the Grafana dash is recommended nowadays." [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [21:22:19] (03CR) 10Krinkle: "btw, is it documented somewhere how to get patches merged here? I'm cc-ing you two based on previous commits and based on us not having +2" [labs/private] - 10https://gerrit.wikimedia.org/r/635859 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke) [21:57:40] (03CR) 10QChris: [C: 04-1] "> Patch Set 15:" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [23:44:58] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 484 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:46:36] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 15 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops