[03:37:11] <icinga-wm>	 PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[04:02:11] <icinga-wm>	 RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:43:52] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.237 second response time
[04:58:52] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1979 bytes in 0.120 second response time
[05:30:52] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1976 bytes in 0.104 second response time
[05:55:52] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.091 second response time
[06:29:41] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[10:40:02] <icinga-wm>	 PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100%
[10:53:18] <elukey>	 !log powercycle mw1272 - not responsive to ssh, mgmt com2 console showing "[OK" and no tty
[10:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:10] <elukey>	 nothing suspicious from the prometheus machine stats' metris
[10:54:12] <elukey>	 *metrics
[10:55:11] <icinga-wm>	 RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[10:58:34] <elukey>	 pybal just let traffic go through mw1272, httpd's access logs are good
[10:59:13] <elukey>	 from a quick look in the logs I haven't found much
[11:01:52] <elukey>	 now I am seeing some 503s coming for search.wikimedia.org
[11:02:00] <elukey>	 but afaics those are not related
[12:13:12] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.24, 35.69, 31.73
[12:34:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.74, 36.78, 33.39
[12:35:51] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3036 is CRITICAL: CRITICAL: expiry mailbox lag is 2052609
[13:09:22] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.13, 36.19, 32.77
[13:42:06] <elukey>	 !log restart hhvm on mw1227 due to high load (hhvm dump debug in /tmp/hhvm.44071.bt)
[13:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:31] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.27, 14.81, 23.47
[16:09:41] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[16:10:22] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[16:10:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 43.28, 35.65, 29.86
[16:18:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 37.49, 35.21, 32.14
[16:20:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 33.68, 34.11, 32.09
[17:11:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 45.70, 36.73, 31.91
[17:12:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1224 is CRITICAL: CRITICAL - load average: 36.82, 33.54, 32.23
[17:58:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 35.58, 33.97, 32.11
[18:02:02] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 46.76, 37.58, 32.86
[18:02:53] <wikibugs>	 (03PS1) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181)
[18:03:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 45.89, 36.02, 29.92
[18:03:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 36.85, 34.76, 32.29
[18:03:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy)
[18:03:27] <wikibugs>	 (03PS2) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181)
[18:04:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy)
[18:06:26] <wikibugs>	 (03PS3) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181)
[18:08:42] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 38.78, 35.51, 31.87
[18:09:18] <wikibugs>	 (03PS4) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181)
[18:14:45] <wikibugs>	 (03PS1) 10Reedy: Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181)
[18:15:00] <wikibugs>	 (03PS2) 10Reedy: Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181)
[18:21:42] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2045221
[18:23:11] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 2013904
[18:34:02] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 38.89, 35.36, 30.94
[18:34:12] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 40.21, 35.33, 32.00
[18:35:12] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.50, 36.84, 32.63
[18:56:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 39.13, 34.02, 32.28
[19:08:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 48.46, 38.38, 33.28
[19:10:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 43.01, 42.59, 40.09
[19:14:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 48.06, 42.75, 40.56
[19:26:13] <zhuyifei1999_>	 Dereckson: have you got the files in https://phabricator.wikimedia.org/T191610 downloaded? /me wants to delete the files on that host and unmount that bind mount so next time I pool that host I don't forget to do so
[19:27:05] * Dereckson checks
[19:28:17] <Dereckson>	 https://commons.wikimedia.org/wiki/File:Jeonbuk_Memorial_altar_for_70th_anniversary_of_the_Jeju_April_3rd_Incident_(2).webm looks good
[19:28:20] <Dereckson>	 yes, you can delete them
[19:28:30] <zhuyifei1999_>	 k
[19:28:39] <zhuyifei1999_>	 thanks
[19:33:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 45.41, 41.52, 40.13
[19:36:39] <Reedy>	 Dereckson: btw, tomorrow is supposed to be a WMF holiday with regards to you wanting to create wikis
[19:45:11] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[19:47:31] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[19:52:18] <Dereckson>	 Reedy: I've asked a window Wednesday instead
[19:53:00] <Dereckson>	 https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1300
[19:53:23] <Reedy>	 :)
[19:53:39] <Reedy>	 I would've done them in the last few weeks if I didn't end up laptop-less and away from home
[19:56:17] <Dereckson>	 speaking of wiki creation, I've tested a script variation to be able to create them from their own configuration, and not aawiki: the issue is simply that the db load balancer throws an exception if the database doesn't exist, the remaining of MediaWiki, as far as a maintanance script is concerned, is happy without tables.
[19:57:34] <Dereckson>	 (+ the multiversion entry point)
[19:58:12] <Dereckson>	 That would simplify a lot arguments.
[19:59:03] <Reedy>	 If we can find a way to stop/delay the db connection stuff
[20:05:31] <elukey>	 !log restart hhvm on mw122[3,4] - high load
[20:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1224 is OK: OK - load average: 5.90, 12.17, 23.75
[20:17:12] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 45.08, 42.29, 40.01
[20:17:45] <elukey>	 !log restart hhvm on mw122[6-8] - high load
[20:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:32] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 8.74, 10.95, 23.90
[20:21:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1344 is CRITICAL: CRITICAL - load average: 54.50, 51.01, 48.43
[20:24:51] <_joe_>	 !log restarted mw1280-4, high load
[20:24:54] <elukey>	 !log restart hhvm on mw1229-31 - high load
[20:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 7.70, 11.15, 23.36
[20:31:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 10.63, 15.45, 29.91
[20:32:12] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 10.41, 17.61, 29.03
[20:32:15] <elukey>	 !log restart hhvm on mw12[32-35] - high load
[20:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 9.44, 15.46, 29.42
[20:34:31] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 11.39, 12.75, 23.78
[20:38:27] <elukey>	 !log restart hhvm on mw12[22,79,82] - high load
[20:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:12] <_joe_>	 !log restart mw1344, high load
[20:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:12] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 8.55, 10.49, 23.79
[20:41:32] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.61, 13.27, 23.32
[20:42:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1344 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[20:43:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.105 second response time
[20:44:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 8.84, 11.18, 23.03
[20:45:03] <elukey>	 !log restart hhvm on mw[1285,1287,1289-1290] - high load
[20:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:01] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 8.90, 12.14, 23.74
[20:47:22] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.83, 11.26, 23.31
[20:48:57] <elukey>	 !log restart hhvm on mw13[12-14] - high load
[20:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:12] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1344 is OK: OK - load average: 13.58, 21.47, 35.14
[20:51:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 10.47, 11.63, 23.27
[20:51:42] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 11.07, 11.50, 23.62
[20:52:32] <elukey>	 !log restart hhvm on mw13[43,45,46,48] - high load
[20:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 12.19, 15.29, 29.17
[21:15:41] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend - port 3128 on cp3036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:21:32] <icinga-wm>	 RECOVERY - Varnish HTTP upload-backend - port 3128 on cp3036 is OK: HTTP OK: HTTP/1.1 200 OK - 220 bytes in 3.074 second response time
[21:22:51] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[21:24:31] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[21:25:37] <elukey>	 we are working on --^
[21:30:49] <ema>	 !log cp3036: restart varnish-be
[21:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:43] <ema>	 !log cp3038: restart varnish-be
[21:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:51] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3036 is OK: OK: expiry mailbox lag is 0
[21:41:51] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0
[21:42:46] <elukey>	 !log restart hhvm on mw1286,1317,1339 - high load
[21:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:39] <ema>	 !log cp3039: restart varnish-be
[21:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:11] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 0
[21:59:41] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[22:00:01] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[22:09:25] <ema>	 !log cp3037: restart varnish-be
[22:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:31] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 316 MB (3% inode=75%)
[23:55:12] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1316 is CRITICAL: CRITICAL - load average: 49.96, 48.87, 48.11