[00:00:01] rzl: ^ that is in response to forgetting to add the dir [00:00:26] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:51] mutante: I haven't read closely but doesn't that create /srv/deployment/httpbb-tests/ but not /srv/deployment/httpbb-tests/appserver? [00:01:32] uhm.. maybe. checking it [00:02:00] I like the idea though -- I have to duck out unless you need anything right away, but I can take a look first thing Monday :) [00:02:25] rzl: I was about to say.. it has lots of time. Enjoy the weekend! [00:02:30] you too 👋 [00:02:34] thanks, cya [00:03:16] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01061 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:03:31] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10Swagoel) Thank you! [00:03:57] (03PS2) 10Dzahn: httpbb: auto-create directories for test suites [puppet] - 10https://gerrit.wikimedia.org/r/648385 [00:16:52] 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Tgr) Possibly a bug in some error handling code causing the bot to instantly retry? [00:25:08] (03PS1) 10Alexandros Kosiaris: Calico: Bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/648389 [00:28:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] Calico: Bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/648389 (owner: 10Alexandros Kosiaris) [00:29:56] (03Merged) 10jenkins-bot: Calico: Bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/648389 (owner: 10Alexandros Kosiaris) [00:34:04] (03PS4) 10Ryan Kemper: categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) [00:35:39] (03CR) 10jerkins-bot: [V: 04-1] categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [00:42:04] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1535836288 and 94 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:16] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2667541384 and 151 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:26] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2402801096 and 154 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:20] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5246412688 and 302 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:24] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2173153040 and 137 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:30] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4328007472 and 263 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:44] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8655545896 and 510 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:50] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2561732168 and 159 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:17] (03PS3) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [00:46:48] (03CR) 10jerkins-bot: [V: 04-1] httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [00:49:39] (03PS1) 10Alexandros Kosiaris: calico: Re-add RBAC for configmaps to calico-node [deployment-charts] - 10https://gerrit.wikimedia.org/r/648403 [00:49:44] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 109120 and 193 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:00] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 96664 and 209 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:18] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 248944 and 226 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:56] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 248416 and 266 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:02] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 216664 and 271 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:10] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:27] (03PS4) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [00:53:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Re-add RBAC for configmaps to calico-node [deployment-charts] - 10https://gerrit.wikimedia.org/r/648403 (owner: 10Alexandros Kosiaris) [00:53:42] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 94888 and 432 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:06] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 194040 and 456 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:32] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 41768 and 482 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:50] (03Merged) 10jenkins-bot: calico: Re-add RBAC for configmaps to calico-node [deployment-charts] - 10https://gerrit.wikimedia.org/r/648403 (owner: 10Alexandros Kosiaris) [00:55:53] (03PS5) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [00:57:02] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3122480192 and 184 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:30] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 644342160 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:00] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 61703832 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:36] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1223332920 and 82 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:00] (03PS6) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [01:02:54] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2976 and 58 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:56] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:30] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1800 and 93 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:34] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1696 and 96 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:04] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 37224 and 126 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:26] (03PS7) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [01:18:11] (03PS5) 10Ryan Kemper: categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) [01:19:19] (03PS8) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [01:19:41] (03CR) 10jerkins-bot: [V: 04-1] categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [01:21:27] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/27120/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [01:38:51] (03PS6) 10Ryan Kemper: categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) [01:40:27] (03PS7) 10Ryan Kemper: categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) [01:40:29] (03CR) 10jerkins-bot: [V: 04-1] categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [01:42:21] (03CR) 10jerkins-bot: [V: 04-1] categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [01:44:36] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27122/console" [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [01:51:08] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 103.43, 100.76, 94.96 https://wikitech.wikimedia.org/wiki/Swift [02:02:06] (03PS8) 10Ryan Kemper: categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) [02:03:38] (03CR) 10jerkins-bot: [V: 04-1] categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [02:05:26] (03PS9) 10Ryan Kemper: categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) [02:06:57] (03CR) 10jerkins-bot: [V: 04-1] categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [02:11:52] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:46] (03CR) 10Ori.livneh: "I notice that many of the changes are both 2 and 3 compatible:" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [02:35:18] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 108.66, 101.79, 100.73 https://wikitech.wikimedia.org/wiki/Swift [03:26:06] (03PS14) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [03:26:19] (03CR) 10Mstyles: Add new helm chart for rdf-streaming-updater (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [03:26:27] (03CR) 10Mstyles: "> Patch Set 13:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [04:10:10] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 106.24, 100.93, 100.45 https://wikitech.wikimedia.org/wiki/Swift [05:05:48] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 104.81, 101.21, 98.16 https://wikitech.wikimedia.org/wiki/Swift [06:39:30] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 107.98, 100.31, 95.36 https://wikitech.wikimedia.org/wiki/Swift [07:48:10] RECOVERY - Check systemd state on an-test-client1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:04] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 100.00, 97.06, 100.16 https://wikitech.wikimedia.org/wiki/Swift [07:57:20] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 102.74, 99.02, 100.28 https://wikitech.wikimedia.org/wiki/Swift [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201212T0800) [08:39:30] PROBLEM - Check systemd state on cp1075 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:04] RECOVERY - Check systemd state on cp1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:14] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 34.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:11:16] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:11:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:12:54] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [12:26:52] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:48] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:04] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:00] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:08] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:04] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:00] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:58] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:10] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:04] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:58] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:10] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:57] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Jclark-ctr) [23:08:47] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Jclark-ctr) [23:40:36] (03PS1) 10Andrew Bogott: Cinder: include custon resource_filters.json file [puppet] - 10https://gerrit.wikimedia.org/r/648770 (https://phabricator.wikimedia.org/T269511) [23:41:57] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: include custon resource_filters.json file [puppet] - 10https://gerrit.wikimedia.org/r/648770 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [23:43:13] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Peachey88)