[03:37:11] PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [04:02:11] RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:43:52] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.237 second response time [04:58:52] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1979 bytes in 0.120 second response time [05:30:52] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1976 bytes in 0.104 second response time [05:55:52] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.091 second response time [06:29:41] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [10:40:02] PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100% [10:53:18] !log powercycle mw1272 - not responsive to ssh, mgmt com2 console showing "[OK" and no tty [10:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:10] nothing suspicious from the prometheus machine stats' metris [10:54:12] *metrics [10:55:11] RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:58:34] pybal just let traffic go through mw1272, httpd's access logs are good [10:59:13] from a quick look in the logs I haven't found much [11:01:52] now I am seeing some 503s coming for search.wikimedia.org [11:02:00] but afaics those are not related [12:13:12] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.24, 35.69, 31.73 [12:34:21] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.74, 36.78, 33.39 [12:35:51] PROBLEM - Check Varnish expiry mailbox lag on cp3036 is CRITICAL: CRITICAL: expiry mailbox lag is 2052609 [13:09:22] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.13, 36.19, 32.77 [13:42:06] !log restart hhvm on mw1227 due to high load (hhvm dump debug in /tmp/hhvm.44071.bt) [13:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:31] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.27, 14.81, 23.47 [16:09:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:10:22] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:10:41] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 43.28, 35.65, 29.86 [16:18:41] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 37.49, 35.21, 32.14 [16:20:41] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 33.68, 34.11, 32.09 [17:11:01] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 45.70, 36.73, 31.91 [17:12:51] PROBLEM - High CPU load on API appserver on mw1224 is CRITICAL: CRITICAL - load average: 36.82, 33.54, 32.23 [17:58:11] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 35.58, 33.97, 32.11 [18:02:02] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 46.76, 37.58, 32.86 [18:02:53] (03PS1) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) [18:03:11] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 45.89, 36.02, 29.92 [18:03:11] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 36.85, 34.76, 32.29 [18:03:19] (03CR) 10jerkins-bot: [V: 04-1] Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [18:03:27] (03PS2) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) [18:04:43] (03CR) 10jerkins-bot: [V: 04-1] Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [18:06:26] (03PS3) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) [18:08:42] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 38.78, 35.51, 31.87 [18:09:18] (03PS4) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) [18:14:45] (03PS1) 10Reedy: Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181) [18:15:00] (03PS2) 10Reedy: Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181) [18:21:42] PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2045221 [18:23:11] PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 2013904 [18:34:02] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 38.89, 35.36, 30.94 [18:34:12] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 40.21, 35.33, 32.00 [18:35:12] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.50, 36.84, 32.63 [18:56:11] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 39.13, 34.02, 32.28 [19:08:51] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 48.46, 38.38, 33.28 [19:10:01] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 43.01, 42.59, 40.09 [19:14:11] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 48.06, 42.75, 40.56 [19:26:13] Dereckson: have you got the files in https://phabricator.wikimedia.org/T191610 downloaded? /me wants to delete the files on that host and unmount that bind mount so next time I pool that host I don't forget to do so [19:27:05] * Dereckson checks [19:28:17] https://commons.wikimedia.org/wiki/File:Jeonbuk_Memorial_altar_for_70th_anniversary_of_the_Jeju_April_3rd_Incident_(2).webm looks good [19:28:20] yes, you can delete them [19:28:30] k [19:28:39] thanks [19:33:11] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 45.41, 41.52, 40.13 [19:36:39] Dereckson: btw, tomorrow is supposed to be a WMF holiday with regards to you wanting to create wikis [19:45:11] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [19:47:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:52:18] Reedy: I've asked a window Wednesday instead [19:53:00] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1300 [19:53:23] :) [19:53:39] I would've done them in the last few weeks if I didn't end up laptop-less and away from home [19:56:17] speaking of wiki creation, I've tested a script variation to be able to create them from their own configuration, and not aawiki: the issue is simply that the db load balancer throws an exception if the database doesn't exist, the remaining of MediaWiki, as far as a maintanance script is concerned, is happy without tables. [19:57:34] (+ the multiversion entry point) [19:58:12] That would simplify a lot arguments. [19:59:03] If we can find a way to stop/delay the db connection stuff [20:05:31] !log restart hhvm on mw122[3,4] - high load [20:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:11] RECOVERY - High CPU load on API appserver on mw1224 is OK: OK - load average: 5.90, 12.17, 23.75 [20:17:12] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 45.08, 42.29, 40.01 [20:17:45] !log restart hhvm on mw122[6-8] - high load [20:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:32] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 8.74, 10.95, 23.90 [20:21:11] PROBLEM - High CPU load on API appserver on mw1344 is CRITICAL: CRITICAL - load average: 54.50, 51.01, 48.43 [20:24:51] <_joe_> !log restarted mw1280-4, high load [20:24:54] !log restart hhvm on mw1229-31 - high load [20:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:21] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 7.70, 11.15, 23.36 [20:31:21] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 10.63, 15.45, 29.91 [20:32:12] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 10.41, 17.61, 29.03 [20:32:15] !log restart hhvm on mw12[32-35] - high load [20:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:11] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 9.44, 15.46, 29.42 [20:34:31] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 11.39, 12.75, 23.78 [20:38:27] !log restart hhvm on mw12[22,79,82] - high load [20:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:12] <_joe_> !log restart mw1344, high load [20:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:12] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 8.55, 10.49, 23.79 [20:41:32] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.61, 13.27, 23.32 [20:42:12] PROBLEM - Apache HTTP on mw1344 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [20:43:12] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.105 second response time [20:44:41] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 8.84, 11.18, 23.03 [20:45:03] !log restart hhvm on mw[1285,1287,1289-1290] - high load [20:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:01] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 8.90, 12.14, 23.74 [20:47:22] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.83, 11.26, 23.31 [20:48:57] !log restart hhvm on mw13[12-14] - high load [20:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:12] RECOVERY - High CPU load on API appserver on mw1344 is OK: OK - load average: 13.58, 21.47, 35.14 [20:51:41] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 10.47, 11.63, 23.27 [20:51:42] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 11.07, 11.50, 23.62 [20:52:32] !log restart hhvm on mw13[43,45,46,48] - high load [20:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:21] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 12.19, 15.29, 29.17 [21:15:41] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp3036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:21:32] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp3036 is OK: HTTP OK: HTTP/1.1 200 OK - 220 bytes in 3.074 second response time [21:22:51] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:24:31] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [21:25:37] we are working on --^ [21:30:49] !log cp3036: restart varnish-be [21:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:43] !log cp3038: restart varnish-be [21:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:51] RECOVERY - Check Varnish expiry mailbox lag on cp3036 is OK: OK: expiry mailbox lag is 0 [21:41:51] RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0 [21:42:46] !log restart hhvm on mw1286,1317,1339 - high load [21:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:39] !log cp3039: restart varnish-be [21:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:11] RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 0 [21:59:41] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [22:00:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:09:25] !log cp3037: restart varnish-be [22:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:31] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 316 MB (3% inode=75%) [23:55:12] PROBLEM - High CPU load on API appserver on mw1316 is CRITICAL: CRITICAL - load average: 49.96, 48.87, 48.11