[00:05:47] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [00:05:47] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [00:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:20] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [00:06:20] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [00:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:34] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:10:55] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:13:18] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:08] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:16:08] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:53] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:18:53] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:43] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:21:44] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:37] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:31] (03PS7) 10Krinkle: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593654 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [00:38:57] (03CR) 10Krinkle: [C: 03+2] Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593654 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [00:39:47] (03Merged) 10jenkins-bot: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593654 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [00:41:27] * Krinkle staging on mwdebug1001 [00:55:16] !log krinkle@deploy1001 Synchronized wmf-config/logging.php: I046868190b472 (duration: 01m 13s) [00:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:50] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:09:12] (03PS12) 10Bmansurov: Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) [03:53:07] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10AntiCompositeNumber) [03:53:15] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10AntiCompositeNumber) rsvg-convert 2.40.16 did not process this file in >5 minutes, 2.48.4... [04:03:53] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10AntiCompositeNumber) rsvg-convert 2.40.16 processed this file at 1000px in 75 seconds, wh... [04:56:02] (03CR) 10BPirkle: [C: 03+1] "Looks good, approved for self-merge and deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596538 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [06:29:18] PROBLEM - Check systemd state on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:26] PROBLEM - MD RAID on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:00] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:08] PROBLEM - Check size of conntrack table on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:30] PROBLEM - Check systemd state on ores1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:34] PROBLEM - ores uWSGI web app on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:31:40] PROBLEM - puppet last run on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:52] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:26] PROBLEM - ores_workers_running on ores1006 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:42:30] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:16] PROBLEM - ores_workers_running on ores1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:43:22] RECOVERY - Check size of conntrack table on ores1006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:43:28] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:46:14] RECOVERY - ores_workers_running on ores1006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:51:14] RECOVERY - MD RAID on ores1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:53:12] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:36] RECOVERY - ores_workers_running on ores1003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200516T0700) [08:13:17] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596726 (owner: 10Jforrester) [08:34:57] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10elukey) The issue happens from time to time, always at the same time. @Halfak is there any possible solution t... [09:52:16] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:15:22] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:43:42] (03CR) 10Hashar: "With the last CI jobs moved to Docker containers, we barely rely on puppet anymore. I have done some cleanup a few months ago but the rema" [puppet] - 10https://gerrit.wikimedia.org/r/596687 (https://phabricator.wikimedia.org/T252190) (owner: 10Dzahn) [15:16:38] !log krinkle@mc1020 Looking at why there are still over 2M echo:seen keys in redis main stash [15:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:20] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:17:38] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:24:27] !log krinkle@mc1020 Prune old echo:seen: keys that have ttl:-1 from Redis main stash, ref T252945 [17:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:30] T252945: Avoid constant evictions on Redis main stash - https://phabricator.wikimedia.org/T252945 [17:29:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:34:55] Krinkle: nice! [17:35:45] Krinkle: I still see some evictions now though :( [17:35:52] elukey: on mc1020? [17:36:15] ah nono aggregated [17:36:28] we could even breakdown the graph per shard [17:36:35] It is [17:36:36] I think [17:36:46] I also added breakdown by instance earlier today [17:37:05] Ah I guess shard=instance [17:37:46] ah ok if you specifically select one, I was thinking to an aggregated graph with all shards [17:38:00] anyway, evictions are way less, good job :) [17:39:16] it is also worth to notice that memory usage is basically maxed out for almost all shards [17:39:36] and this is the pre-condition to enter LRU mode [17:39:39] + evictions [17:41:29] elukey: yeah, I don't know what it's configured at but the constnat 520M line for each shard is certainly suspicious [17:41:35] that basically just tells me it's at the ceiling [17:41:42] see T252945 [17:41:43] T252945: Avoid constant evictions on Redis main stash - https://phabricator.wikimedia.org/T252945 [17:42:48] maxmemory 500Mb [17:42:48] maxmemory-policy volatile-lru [17:43:00] this is from /etc/redis/tcp_6379.conf [17:43:15] right [17:43:31] elukey: hm.. but LRU stands for least-recently-used [17:43:44] I know to not to expect perfection in caching/lru [17:44:00] but then how come it still hasn't gotten to the millions of unused no-ttl keys from months months ago [17:44:03] yes after redis maxes out memory, it starts evicting, following the LRU policy [17:44:47] ahh interesting! [17:44:49] "Evicts the least recently used keys out of all keys with an "expire" field set" [17:44:58] :facepalm: [17:45:02] this is the "volatile-lru" [17:45:09] Right, that's not a bad policy [17:45:17] if we used it correctly :P [17:45:19] cool [17:45:27] otherwise there is allkeys-lru [17:45:44] so given that we very bravely gave everything a TTL [17:45:56] and that 90% is used up by legacy Echo values from pre-2019 [17:46:04] that means it's basically only deleting stuff we want to keep [17:46:06] great [17:46:19] yep [17:46:21] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: set ceph nodes to use CPU features available on all cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/596762 (https://phabricator.wikimedia.org/T225320) (owner: 10Andrew Bogott) [17:46:31] are those pre-2019 values droppable? [17:46:38] see task [17:46:38] yes [17:46:39] (03PS1) 10Andrew Bogott: Move cloudvirt-wdqs hosts off of ceph [puppet] - 10https://gerrit.wikimedia.org/r/596816 (https://phabricator.wikimedia.org/T252784) [17:46:54] in 2019, Echo was fixed to give its keys a ttl of 1 year [17:47:04] also, later that year it was migrated to echoseen-kask [17:47:10] so it's not even reading/writing these at all anymore afaik [17:47:17] but I'm not willing to make that call on a saturday [17:47:26] ah ok we are waiting for confirmation, great [17:47:29] yes yes :) [17:47:35] wise choice [17:47:36] but I am dropping the no-ttl ones at least [17:47:55] which on the shards I looked at is most or all of the echo:seen keys, so it has definitely turned over everything else at least once since Octover [17:48:09] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirt-wdqs hosts off of ceph [puppet] - 10https://gerrit.wikimedia.org/r/596816 (https://phabricator.wikimedia.org/T252784) (owner: 10Andrew Bogott) [17:48:10] but I'm keeping the extra ttl check just in case [17:48:41] s/everything/anything volatile/ [17:49:08] super, really interesting discovery [17:49:21] going afk, have a good (rest of) the weekend :) [17:49:25] !log krinkle@mwmaint1002: Running cleanupRemovedModules.php to prune old module_deps rows T113916 [17:49:26] thanks [17:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:28] T113916: Redesign ResourceLoader's file dependency tracking (module_deps) - https://phabricator.wikimedia.org/T113916 [17:51:28] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993 (10Aklapper) [17:53:19] 10Operations, 10Pybal, 10Traffic: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993 (10Aklapper) 05Stalled→03Resolved a:03Vgutierrez >>! In T190993#4126816, @Vgutierrez wrote: > Let's keep pybal-test2001 as jessie till we don't have any LVS on production running jessie... [17:53:22] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961 (10Aklapper) [17:53:32] 10Operations, 10Pybal, 10Traffic: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993 (10Aklapper) [17:55:04] 10Operations, 10JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069 (10Aklapper) [17:56:18] !log krinkle@mc1023 Pruning old echo:seen: Redis keys that didn't use a ttl yet, ref T252945 [17:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:21] T252945: Avoid constant evictions on Redis main stash - https://phabricator.wikimedia.org/T252945 [18:24:03] !log krinkle@mc1025 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945 [18:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:07] T252945: Avoid constant evictions on Redis main stash - https://phabricator.wikimedia.org/T252945 [18:30:07] !log krinkle@mc1024 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945 [18:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:11] T252945: Avoid constant evictions on Redis main stash - https://phabricator.wikimedia.org/T252945 [18:48:07] 10Operations, 10Internet-Archive, 10Offline-Working-Group: Create backups of Wikimedia content in diverse geographic places - https://phabricator.wikimedia.org/T156544 (10Aklapper) 05Stalled→03Open [18:54:55] !log krinkle@mc1026 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945 [18:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:59] T252945: Avoid constant evictions on Redis main stash - https://phabricator.wikimedia.org/T252945 [18:58:50] !log krinkle@mc1027 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945 [18:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:22] !log krinkle@mc1028 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [19:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:56] !log krinkle@mc1029 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [19:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:07] !log krinkle@mc1030 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [19:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:11] !log krinkle@mc1031 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [19:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:12] !log krinkle@mc1032 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [19:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:54] !log krinkle@mc1033 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [20:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:53] !log krinkle@mc1034,mc1035,mc1036 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [20:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 67 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:04:52] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:14:47] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Fix Wikidata dispatch - https://phabricator.wikimedia.org/T252952 (10Addshore) Dispatching is slowing down, as the db server is lagged and the dispatch process has a waitForReplication call in the code. Specifically this... [21:16:31] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag - https://phabricator.wikimedia.org/T252952 (10Addshore) [21:21:29] 10Operations, 10DBA, 10Wikidata, 10Patch-For-Review, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag - https://phabricator.wikimedia.org/T252952 (10Ladsgroup) I contact @Marostegui [21:41:31] 10Operations, 10DBA, 10Wikidata, 10Patch-For-Review, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag - https://phabricator.wikimedia.org/T252952 (10Addshore) p:05Unbreak!→03High [21:43:25] 10Operations, 10DBA, 10Wikidata, 10Patch-For-Review, and 2 others: Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag - https://phabricator.wikimedia.org/T252952 (10Addshore) 05Open→03Resolved a:03Addshore Marking as resolved as the impact on wikidata is now gone > 1... [21:46:06] 10Operations, 10DBA, 10Wikidata, 10Patch-For-Review, and 2 others: Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag - https://phabricator.wikimedia.org/T252952 (10jcrespo) {P11212} [21:46:20] 10Operations, 10DBA, 10Wikidata, 10Patch-For-Review, and 2 others: Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag - https://phabricator.wikimedia.org/T252952 (10Marostegui) Some more details. There were a few long running queries ` | 669154401 | wikiuser | 10.64... [21:46:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:50:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:56:36] !log krinkle@mc1019 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [21:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:53] !log krinkle@mc1022 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet [22:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:00:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:33:03] (03PS1) 10Krinkle: contint: Remove mention of unused global agent script [puppet] - 10https://gerrit.wikimedia.org/r/596833 (https://phabricator.wikimedia.org/T252955)