[00:05:53] <icinga-wm>	 PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:09:23] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) @eevans Alright, despite the issues above the server has been reinstalled now and is on st...
[00:09:42] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) a:05Cmjohnson→03Eevans cc: @fgiunchedi
[00:10:37] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[00:13:02] <wikibugs>	 (03Abandoned) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji)
[00:13:45] <wikibugs>	 (03PS5) 10Huji: Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553)
[00:14:09] <wikibugs>	 (03CR) 10Huji: "Yes. It has been sitting here for a long time despite community consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji)
[00:22:34] <icinga-wm>	 PROBLEM - MD RAID on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:28:32] <icinga-wm>	 PROBLEM - Check size of conntrack table on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[00:28:32] <icinga-wm>	 PROBLEM - configured eth on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[00:30:36] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:36] <icinga-wm>	 PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[00:31:19] <mutante>	 ^ ehm... i reinstalled that earlier. looking
[00:32:24] <icinga-wm>	 RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:32:34] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP
[00:32:34] <icinga-wm>	 PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:33:27] <mutante>	 !log wikitech-static - fix /etc/letsencrypt/renewal/wikitech-static.wikimedia.org.conf - remove webroot_map and and line for status.wm.org that caused errors when doing a renewal dry-run. now dry run finishes succesfully and we are using "webroot" authenticator and not "apache" anymore. This should have resolved what this ticket was about. No more Apache kills/restarts on renewal. (T214640)
[00:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:37] <stashbot>	 T214640: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640
[00:34:34] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:34:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:35:48] <mutante>	 !log restbase-dev1006 - starting nagios-nrpe-server
[00:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:34] <icinga-wm>	 PROBLEM - DPKG on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[00:36:34] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T224260 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:36:34] <icinga-wm>	 ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T224260 https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:39:16] <icinga-wm>	 PROBLEM - IPMI Sensor Status on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[00:39:20] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[00:40:24] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1006&var-datasource=eqiad+prometheus/ops
[00:42:21] <mutante>	 weird, puppet mysteriously works now but instead we have these :p
[00:50:07] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) actually.. puppet run is not failing anymore now.  :)  though..  i had to restart nagios-n...
[01:00:54] <wikibugs>	 (03CR) 10Eevans: table-properties: Initial commit (032 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust)
[01:04:54] <urandom>	 mutante: puppet runs without error?
[01:05:35] <urandom>	 mutante: this machine is cursed
[01:05:49] <urandom>	 we should cleanse it with fire
[01:06:04] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 52.44, 23.52, 12.94 https://wikitech.wikimedia.org/wiki/Application_servers
[01:07:37] <urandom>	 "Filesystem available is greater than filesystem size"
[01:07:41] <urandom>	 nice.
[01:07:42] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 25.13, 23.44, 14.04 https://wikitech.wikimedia.org/wiki/Application_servers
[01:07:56] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 76.14, 34.75, 19.29 https://wikitech.wikimedia.org/wiki/Application_servers
[01:08:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[01:08:28] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:08:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[01:08:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[01:08:40] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:08:44] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[01:08:44] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi
[01:08:44] <icinga-wm>	 es/Monitoring/recommendation_api
[01:08:52] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:08:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[01:09:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[01:09:02] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece
[01:09:02] <icinga-wm>	 tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:09:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:09:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:09:08] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:09:08] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:09:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:09:32] <icinga-wm>	 PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[01:09:34] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 25.63, 29.97, 19.24 https://wikitech.wikimedia.org/wiki/Application_servers
[01:10:02] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.327 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[01:10:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:22] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[01:10:24] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:10:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[01:10:30] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 73.06, 38.94, 23.47 https://wikitech.wikimedia.org/wiki/Application_servers
[01:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:10:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:42] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:52] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:11:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 76371 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[01:11:00] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 56.93, 29.03, 16.28 https://wikitech.wikimedia.org/wiki/Application_servers
[01:11:02] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 61.71, 31.34, 18.03 https://wikitech.wikimedia.org/wiki/Application_servers
[01:11:02] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:11:24] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 73.02, 39.14, 21.23 https://wikitech.wikimedia.org/wiki/Application_servers
[01:11:32] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 71.98, 36.73, 19.49 https://wikitech.wikimedia.org/wiki/Application_servers
[01:11:42] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.63, 29.31, 16.81 https://wikitech.wikimedia.org/wiki/Application_servers
[01:11:48] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:11:48] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:11:54] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:12:20] <wikibugs>	 (03CR) 10Eevans: table-properties: Initial commit (031 comment) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust)
[01:13:18] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 19.44, 24.53, 16.33 https://wikitech.wikimedia.org/wiki/Application_servers
[01:14:14] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 12.77, 23.76, 16.99 https://wikitech.wikimedia.org/wiki/Application_servers
[01:14:16] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 13.66, 24.15, 17.94 https://wikitech.wikimedia.org/wiki/Application_servers
[01:15:00] <icinga-wm>	 PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:15:18] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 17.72, 35.87, 27.61 https://wikitech.wikimedia.org/wiki/Application_servers
[01:16:22] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 8.30, 20.88, 17.60 https://wikitech.wikimedia.org/wiki/Application_servers
[01:19:28] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 8.32, 21.63, 22.08 https://wikitech.wikimedia.org/wiki/Application_servers
[01:39:08] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:43:02] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) >>! In T224260#5370545, @Dzahn wrote: > @eevans Alright, despite the issues above the ser...
[01:43:04] <icinga-wm>	 RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:50:12] <icinga-wm>	 RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 1011 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[01:50:14] <icinga-wm>	 PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:50:28] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1013 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:10:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:12:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:16:44] <icinga-wm>	 PROBLEM - Host mw1300 is DOWN: PING CRITICAL - Packet loss = 100%
[02:18:14] <icinga-wm>	 RECOVERY - puppet last run on wtp1033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:18:26] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:38:05] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) >>! In T224260#5370545, @Dzahn wrote: > @eevans Alright, despite the issues above the ser...
[05:42:10] <icinga-wm>	 PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100%
[05:53:44] <icinga-wm>	 RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 30.54 ms
[05:55:30] <icinga-wm>	 PROBLEM - Disk space on actinium is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=actinium&var-datasource=eqiad+prometheus/ops
[05:56:48] <wikibugs>	 (03PS4) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909)
[06:01:30] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[06:01:52] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[06:11:45] <wikibugs>	 (03CR) 10Marostegui: Use GTIDs for master position queries for external DB when possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz)
[06:21:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 64.22, 28.53, 15.35 https://wikitech.wikimedia.org/wiki/Application_servers
[06:22:10] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 67.96, 29.78, 15.87 https://wikitech.wikimedia.org/wiki/Application_servers
[06:23:22] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 80.93, 37.42, 19.16 https://wikitech.wikimedia.org/wiki/Application_servers
[06:24:10] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[06:24:40] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 57.25, 30.81, 17.24 https://wikitech.wikimedia.org/wiki/Application_servers
[06:26:08] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[06:26:16] <icinga-wm>	 RECOVERY - Disk space on actinium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=actinium&var-datasource=eqiad+prometheus/ops
[06:29:26] <icinga-wm>	 PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:33:02] <icinga-wm>	 PROBLEM - puppet last run on ms-be1049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_timedatectl] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:35:16] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 75.53, 45.19, 27.25 https://wikitech.wikimedia.org/wiki/Application_servers
[06:35:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 70.96, 43.01, 27.05 https://wikitech.wikimedia.org/wiki/Application_servers
[06:35:30] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 50.40, 35.79, 23.02 https://wikitech.wikimedia.org/wiki/Application_servers
[06:37:26] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:38:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:40:16] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 10.95, 24.91, 22.23 https://wikitech.wikimedia.org/wiki/Application_servers
[06:43:41] <elukey>	 !log powercycle mw1300 - no ssh, serial com2 stuck with no root loging available
[06:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:54] <icinga-wm>	 RECOVERY - Host mw1300 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[06:48:00] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 7.16, 14.43, 22.62 https://wikitech.wikimedia.org/wiki/Application_servers
[06:49:32] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 6.84, 13.36, 22.92 https://wikitech.wikimedia.org/wiki/Application_servers
[06:55:14] <wikibugs>	 (03CR) 10Aaron Schulz: "Just trying to simplify the config and reduce repetition." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz)
[06:55:22] <icinga-wm>	 RECOVERY - puppet last run on ms-be1049 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:56:32] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 6.91, 9.11, 22.74 https://wikitech.wikimedia.org/wiki/Application_servers
[06:56:50] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 6.94, 8.74, 22.28 https://wikitech.wikimedia.org/wiki/Application_servers
[06:57:24] <icinga-wm>	 RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:58:50] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 7.40, 9.06, 23.94 https://wikitech.wikimedia.org/wiki/Application_servers
[07:00:06] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 7.29, 8.06, 22.49 https://wikitech.wikimedia.org/wiki/Application_servers
[07:11:38] <icinga-wm>	 PROBLEM - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:11:39] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T229156 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:11:42] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10ops-monitoring-bot)
[07:12:14] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1546970425[4](2019-07-23T19:26:29.990Z), dewiki_content_1521846803[4](2019-07-23T19:26:40.352Z), eswiki_content_1521891951[6](2019-07-23T19:26:29.987Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:27:35] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10Peachey88)
[08:02:55] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Peachey88) @Nuria: It was worked out on IRC that they probably need their Hue account created, since they already have NDA LDAP access, see: https://wikitech.wikimedia.org/wiki/Analyt...
[10:22:20] <icinga-wm>	 RECOVERY - MegaRAID on cloudvirt1018 is OK: OK: optimal, 1 logical, 8 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:21:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[11:32:14] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[12:33:16] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[12:36:30] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[12:44:30] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[12:47:42] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[12:52:32] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[13:40:38] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[14:06:26] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_syslog.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[14:24:04] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:24:24] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:41:06] <icinga-wm>	 PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:41:08] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[14:41:40] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:42:02] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:42:36] <icinga-wm>	 RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook
[14:42:42] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:43:24] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:45:02] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:47:12] <librenms-wmf>	 08Warning Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link
[14:49:52] <godog>	 !log bounce rsyslog on wezen / centrallog1001
[14:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:58] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[14:51:22] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_syslog.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[14:53:38] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 756 days) https://wikitech.wikimedia.org/wiki/Logs
[14:57:12] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link
[14:59:20] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[14:59:20] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[16:03:58] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.66, 30.32, 24.27 https://wikitech.wikimedia.org/wiki/Application_servers
[16:23:12] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 14.80, 16.54, 23.72 https://wikitech.wikimedia.org/wiki/Application_servers
[16:35:19] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Tnegrin) Approved
[17:39:07] <bd808>	 !log Updated profile & images for @wikimediatech twitter account
[17:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:37] <Krenair>	 bd808, lol nice
[17:44:53] <bd808>	 the old profile was boring :)
[19:03:40] <wikibugs>	 (03PS1) 10Aaron Schulz: [DNM] Use DBO_DEFAULT for extension1 since it is not for key/value blob storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977
[19:09:10] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:02] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:14] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:36:44] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:39:06] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:56] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:45:01] <wikibugs>	 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) 05Open→03Resolved The issue that was present all the time was resolved, the last file I wasn't able t...
[21:25:32] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10Security: Massive spambot registrations at dinwiki - https://phabricator.wikimedia.org/T212519 (10Aklapper) @sbassett: Is there anything actionable left in this task?
[22:27:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) 05Resolved→03Open a:05Andrew→03wiki_willy I put this system under a realistic load today (running ~80 VMs)...