[00:00:09] RECOVERY - check_missing_thank_yous on db1025 is OK: OK missing_thank_yous=0 [00:00:39] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [00:45:59] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:14:59] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [01:29:39] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 82065.69298 Seconds [01:29:49] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 82071.564071 Seconds [01:29:49] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 81367.851617 Seconds [01:29:59] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 81374.711233 Seconds [01:29:59] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 81374.715292 Seconds [01:30:19] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82108.634418 Seconds [01:35:19] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:38:19] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82588.527414 Seconds [01:46:29] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:55:09] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=1118 [critical =500] [01:57:19] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 27.353132 Seconds [01:57:39] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 44.396854 Seconds [01:57:49] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 50.032767 Seconds [01:57:49] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 16.390511 Seconds [01:57:59] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 23.397094 Seconds [01:57:59] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 23.407951 Seconds [02:00:09] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=829 [critical =500] [02:05:09] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=1130 [critical =500] [02:10:09] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=1876 [critical =500] [02:15:29] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [02:24:10] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 08m 13s) [02:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:04] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 08m 03s) [02:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:38] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Apr 8 02:56:37 UTC 2017 (duration 5m 33s) [02:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:54] (03CR) 10Krinkle: Add skin, language, and variant to user_properties_anon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) (owner: 10Reedy) [03:43:59] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:46:49] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 116750.110755 Seconds [03:46:49] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 116751.207209 Seconds [03:46:59] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 116756.019502 Seconds [03:49:49] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:49:49] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:49:59] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [04:10:29] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3241.60 Read Requests/Sec=1933.90 Write Requests/Sec=1.30 KBytes Read/Sec=25525.20 KBytes_Written/Sec=373.20 [04:11:59] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:12:14] (03PS2) 10Krinkle: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [04:14:06] (03CR) 10jerkins-bot: [V: 04-1] rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [04:17:58] (03CR) 10Krinkle: "Each test case fails with "Attempted to serialize unserializable builtin class"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [04:19:51] (03PS3) 10Krinkle: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [04:20:29] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.70 Read Requests/Sec=1.90 Write Requests/Sec=46.10 KBytes Read/Sec=13.60 KBytes_Written/Sec=252.80 [04:37:57] (03CR) 10Krinkle: "Was caused indirectly by backupGlobals, known phpunit issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [04:38:59] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:39:28] (03CR) 10Krinkle: [C: 031] "Tentatively +1, will leave to Aaron to double-check and verify since I don't know exactly in how and in which situations this code will ru" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [04:48:59] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:06:59] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:08:28] (03CR) 10BryanDavis: Add skin, language, and variant to user_properties_anon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) (owner: 10Reedy) [05:17:59] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:25:39] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:27:49] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:41:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [05:46:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [05:52:40] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:53:39] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [05:55:49] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:20:39] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:23:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:28:39] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:28:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:37:59] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:57:39] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:04:59] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:15:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:24:32] (03PS1) 10Urbanecm: DNS configuration for wb.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/347141 (https://phabricator.wikimedia.org/T162510) [07:30:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:32:29] (03PS1) 10Urbanecm: Add wb.wikimedia.org to ServerAlias for wikimedia-chapter Vhost [puppet] - 10https://gerrit.wikimedia.org/r/347142 (https://phabricator.wikimedia.org/T162510) [07:35:39] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:47:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:52:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:04:39] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:09:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:14:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:17:29] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:19:29] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:22:29] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:29:55] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3165796 (10Urbanecm) https://upload.wikimedia.org/wikipedia/commons/thu... [08:30:49] (03CR) 10Reedy: [C: 032] Switch EducationProgram to extension.json for extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347122 (https://phabricator.wikimedia.org/T162481) (owner: 10Reedy) [08:32:25] (03Merged) 10jenkins-bot: Switch EducationProgram to extension.json for extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347122 (https://phabricator.wikimedia.org/T162481) (owner: 10Reedy) [08:32:38] (03CR) 10jenkins-bot: Switch EducationProgram to extension.json for extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347122 (https://phabricator.wikimedia.org/T162481) (owner: 10Reedy) [08:33:38] !log reedy@tin Synchronized wmf-config/extension-list: T162481 (duration: 00m 40s) [08:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] T162481: Unwanted change of EducationProgram namespace - https://phabricator.wikimedia.org/T162481 [08:34:36] !log reedy@tin Synchronized wmf-config/CommonSettings.php: T162481 (duration: 00m 39s) [08:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:24] !log reedy@tin Started scap: Rebuild EP l10n cache for namespace aliases T162481 [08:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:46:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:47:29] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:26:34] couple of rather sloooooooow hosts [09:32:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:33:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:33:49] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:38:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:38:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:38:49] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:38:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:42:40] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:43:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:55:36] !log reedy@tin Finished scap: Rebuild EP l10n cache for namespace aliases T162481 (duration: 79m 11s) [09:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:44] T162481: Unwanted change of EducationProgram namespace - https://phabricator.wikimedia.org/T162481 [09:55:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:00:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:11:39] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:21:42] PROBLEM - MariaDB disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 173067 MB (5% inode=99%) [10:23:33] checking labsdb1001 [10:23:47] ack [10:23:48] :) [10:24:13] is everything fine? im from my phone? [10:24:21] need help jynus ? [10:24:38] I just connected from my laptop [10:24:49] PROBLEM - Disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 113371 MB (3% inode=99%) [10:25:53] (03PS1) 10Urbanecm: Give sysops ability to promote users to eliminator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) [10:25:53] someone has converted enwiki to innodb [10:25:59] seems sqldata [10:26:19] it is now taking 800 GB [10:26:30] then there is the usermiquel db [10:27:10] xtools has high activity [10:27:14] 10:25 < jynus> someone has converted enwiki to innodb - ??? [10:27:34] we could try to compress it [10:27:43] PROBLEM - MariaDB disk space on labsdb1003 is CRITICAL: DISK CRITICAL - free space: /srv 175371 MB (5% inode=99%) [10:27:51] no, we do not have 10 days to do that [10:28:04] it is 98% usage already [10:28:35] yeah, i meant long term sol [10:28:52] there is also some large temporary tables [10:30:39] PROBLEM - Disk space on labsdb1003 is CRITICAL: DISK CRITICAL - free space: /srv 113566 MB (3% inode=99%) [10:32:36] u3532 has myisam tables [10:32:40] there is one user eating 53G and those tables have not activity this year, they are from Aug 2016... [10:32:42] we can move that to the other partition [10:32:45] yeah, that one is the one I saw [10:34:16] if I can be of any help just tell me ;) [10:34:34] can you improve my 3G connection? :) [10:34:53] lol [10:35:04] no, but I can be your remote-hands though [10:38:48] ok, I have the user tables read-locked [10:38:57] we can move them away, doing now [10:39:48] good, thanks. we can start compressing some tables on monday or something too I guess [10:41:01] mv u3532__ to /srvsqldata [10:41:16] we then need to create a symbolic link [10:43:25] yep, and restart replication once the space is back reclaimed [10:43:37] no idea if this message was sent actually, the connection is slow [10:43:59] tokudb doesn't work when there is little spaca available [10:45:20] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [10:45:20] RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 84.73 ms [10:45:37] is it any better? I cannot see the fs recovering [10:46:39] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:47:34] and now labsdb1003? [10:48:42] PROBLEM - MariaDB disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 186700 MB (5% inode=99%) [10:49:25] in this case, commons wiki uncompressed [10:49:43] what? [10:49:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:49:54] how/why is changing them? [10:49:59] RECOVERY - Host cr2-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.76 ms [10:50:18] there is a 50 GB commonswiki recentchanges [10:50:49] PROBLEM - Disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 127810 MB (3% inode=99%) [10:50:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:51:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:51:53] it has to be temporary tables [10:51:59] look at the graphs: [10:52:25] <_joe_> that 5xx peak is probably due to cr2 coming back up [10:52:31] something is comsuming disk at 100GB / minute [10:52:37] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=1491645129366&to=1491648729366 [10:52:49] RECOVERY - Disk space on labsdb1001 is OK: DISK OK [10:53:18] <_joe_> 100gb/minute must be mostly io-bound [10:53:33] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1003&var-network=eth0&from=1491645200556&to=1491648800556 [10:53:44] I know what it is- it is mysql [10:53:49] I do not know who is doing it [10:54:36] I think it is time to go full emergency mode [10:54:44] read only and kill all running queries [10:55:02] !log setting labsdb1001 and labsdb1003 in read only mode [10:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:15] jynus: did you checked the replica? [10:55:18] <_joe_> jynus: disk usage went down [10:55:24] <_joe_> abrupltly [10:55:29] <_joe_> -l [10:56:42] RECOVERY - MariaDB disk space on labsdb1001 is OK: DISK OK [10:56:59] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:57:39] RECOVERY - Disk space on labsdb1003 is OK: DISK OK [10:57:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:57:42] RECOVERY - MariaDB disk space on labsdb1003 is OK: DISK OK [10:57:45] there are some massive wikidata queries [10:57:49] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:57:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:57:54] but I am not sure 100% it is the cause [10:58:26] I need help to go over the log and see what is the cause [10:58:41] it is clearly some large user creating temporary tables [10:58:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:58:53] almost 500GB of temporary tables [10:59:08] <_joe_> https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=now-30m&to=now that seems queries being killed [10:59:12] <_joe_> or completed [10:59:23] I am killing queries larger than 1000 seconds [10:59:32] I said "full emergency mode" :-) [10:59:34] jynus: I've seen queries with 505880000, 80600000 rows queries and a lot of Copying to tmp table [10:59:35] <_joe_> so yeah there is a 1:1 correspondence I'd say [10:59:59] can anyone help me identify heavy hitters on tendril [11:00:04] sure [11:00:10] I am on a latop on a cafe [11:00:13] <_joe_> volans: are you handling it? [11:00:16] 1001 or 1003? [11:00:16] not the most confortable place [11:00:19] <_joe_> jynus: :/ [11:00:19] both [11:00:32] <_joe_> let me know if you need me to do something [11:01:08] https://tendril.wikimedia.org/report/slow_queries?host=%5Elabsdb&user=&schema=&qmode=eq&query=&hours=1 [11:01:31] I'd say we ban s51362 [11:01:41] prematurelly [11:01:48] even if we don't knot it's him [11:02:06] recover from emergency mode [11:02:11] then investigate more [11:02:32] user has like 20 times the load than the other users [11:02:54] <_joe_> ok [11:03:01] but not on 1001 though [11:03:07] yeah [11:03:09] it doesn't fit [11:03:18] it happened on labsdb1001 [11:03:20] then on 1003 [11:03:39] could be s52256 [11:03:54] are we sure is not something coming from the replica right? it continued after you stopped it? [11:04:25] no, it stopped when I started killing/set read only [11:04:32] there are some issues with the replica [11:04:42] jynus: I thought that at first, but only 21 queries in an hour... although huge ones [11:04:49] I mean regarding s52256 [11:04:49] but those only make things fail faster, they have been there for days [11:04:59] see bellow [11:05:00] there are more [11:05:30] the other option is s51187 [11:06:17] I would re-RW without any ban to see if it starts again [11:06:20] and then try one by one [11:07:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:07:58] ok [11:09:54] we should ban anyway s51362 [11:10:06] it started a 1500 seconds query every 10 seconds [11:10:14] tsum 103,969 yeah [11:10:40] I am going to rate limit that user to 2 connections [11:14:40] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:15:19] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3165924 (10elukey) We just saw a icinga recovery for cr2-esams, and now mtr from netmon1001 to cp3008 for example goes through cr2-eqiad -> cr2-esams. Some 503s were registered while probably VRRP shifte... [11:15:38] _joe_ --^ (re: 503s and cr2-esams coming up) [11:15:48] thanks elukey! [11:16:12] probably the broken part was replaced now? [11:16:33] <_joe_> elukey: it was reseated yesterday [11:17:28] <_joe_> oh no it was cr2-knams [11:17:34] ah okok :D [11:22:37] jynus, I've done a show full processlist on 1003 and my terminal is using 1GB of RAM now... [11:23:12] was a query from s51187 so long that makes my terminal crash in the end [11:23:52] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3165930 (10DatGuy) a:03DatGuy I'm having difficulty with the logo. Could anyone else create it? [11:24:59] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:27:43] * volans bbiab [11:27:45] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3165934 (10Urbanecm) a:05DatGuy>03None Removing @DatGuy as as far as I know he doesn't have rights to prod. [11:28:41] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3165936 (10DatGuy) Well, I'm preparing the config. Maybe self-assigning wasn't the right thing to do though. [11:30:33] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3165952 (10Urbanecm) Yeah but this task should be claimed by the person who create the wiki at the servers not who prepare config. With what you have problem? With creating the log... [11:33:27] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3165957 (10DatGuy) Creating the logofile. The language is similar to Nepali, meaning there are some lines that go over. https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2-ne... [11:34:08] volans, that I would not consider it strange [11:34:26] I have created: https://phabricator.wikimedia.org/T162519 [11:35:32] discussing how is at fault is meaningless- we report what we belive is unfair usage after an issue is detected [11:36:13] that tools seems like an adopted previouslt abandoned tools, so I would expect people to be very cooperative [11:36:23] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3165958 (10Urbanecm) Sadly I can't help you here... Maybe ask at commons somewhere? [11:37:18] swap is worrying still on those servers [11:42:31] (03CR) 10Juniorsys: [C: 031] standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 (owner: 10Dzahn) [11:45:47] (03CR) 10Juniorsys: [C: 031] standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [11:51:40] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3165961 (10elukey) from cr2-esams show log messages: ``` Apr 8 10:44:00 re0.cr2-esams fpc0 CLKSYNC: Transitioned to centralized mode Apr 8 10:44:03 re0.cr2-esams fpc0 I2C Failed device: group 0x52 ad... [11:56:09] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:59] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [12:15:49] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:16:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [12:18:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:18:58] ^^ hmm [12:19:10] why do those keep going off? [12:19:49] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [12:21:21] there was a spike on 503s in this case [12:21:41] oh [12:21:42] is that normal? [12:22:10] not usually [12:22:22] ok, thanks for replying :) [12:22:37] need more investigation, but I can't right now, I'll look at it in a bit, seems recovered for now [12:22:44] ok [12:28:49] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:30:49] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:30:59] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:39:18] sorry that was most-likely a result of the bans [12:39:25] it's happened with some of the past bans, too [12:39:49] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:39:59] bblack: ah, didn't notice the SAL on the other channel... it would be nice to have it announce it also here [12:41:39] yeah, I wonder why it doesn't [12:42:16] probably to avoid double SAL-ing the message :) [12:42:56] it might not have the logic to replay the message here without re-SALing [12:43:19] yeah but we do get SAL echos from other tools [12:43:22] hmmm [12:47:49] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:48:21] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3165972 (10BBlack) [12:50:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 4291 MB (8% inode=96%): /dev 10 MB (100% inode=99%): /run 2898 MB (90% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522315 MB (32% inode=99%) [12:53:24] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10BBlack) The continued reports above were expected, as detail... [12:55:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /dev 10 MB (100% inode=99%): /run 2882 MB (90% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522312 MB (32% inode=99%) [12:58:59] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:00:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /dev 10 MB (100% inode=99%): /run 2874 MB (89% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522309 MB (32% inode=99%) [13:05:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /dev 10 MB (100% inode=99%): /run 2874 MB (89% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522313 MB (32% inode=99%) [13:06:49] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:07:49] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:09:13] wonder what civi1001 is. [13:09:59] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3156449 (10BBlack) As best as I can tell from looking at a longer section of the cr2-esams logs, it really does look like esams remote hands already swapped in the replacement part and things came up norm... [13:10:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /dev 10 MB (100% inode=99%): /run 2874 MB (89% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522310 MB (32% inode=99%) [13:15:10] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /dev 10 MB (100% inode=99%): /run 2874 MB (89% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522307 MB (32% inode=99%) [13:16:49] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:16:56] I'd assume it's civicrm [13:17:04] (which is in fr-tech's world) [13:17:22] https://wikitech.wikimedia.org/wiki/Civicrm.wikimedia.org [13:18:30] yep, CiviCRM... [13:20:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /dev 10 MB (100% inode=99%): /run 2874 MB (89% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522313 MB (32% inode=99%) [13:23:49] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 610734 [13:25:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 97 MB (0% inode=96%): /dev 10 MB (100% inode=99%): /run 2874 MB (89% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522310 MB (32% inode=99%) [13:30:09] RECOVERY - check_disk on civi1001 is OK: DISK OK - free space: / 11214 MB (21% inode=96%): /dev 10 MB (100% inode=99%): /run 2890 MB (90% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 756257 MB (91% inode=99%): /srv/archive/banner_logs 1522307 MB (32% inode=99%) [13:30:39] I was about to silence it... given that being a passive check it alarms everytime the text changes [13:33:49] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [13:34:38] ACKNOWLEDGEMENT - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 20 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdj1] Volans Broken disk: https://phabricator.wikimedia.org/T162347 [13:35:09] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1006 - https://phabricator.wikimedia.org/T162347#3159886 (10Volans) I've also ACK'ed on Icinga the related puppet run alarm [13:35:50] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:55:49] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:29] PROBLEM - Check Varnish expiry mailbox lag on cp2020 is CRITICAL: CRITICAL: expiry mailbox lag is 600010 [14:09:09] PROBLEM - Check Varnish expiry mailbox lag on cp2008 is CRITICAL: CRITICAL: expiry mailbox lag is 610348 [14:09:49] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:14:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:19:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:22:39] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:24:49] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:32:49] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 576033 [14:38:49] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:41:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:46:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:51:49] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:58:59] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:29] PROBLEM - Check Varnish expiry mailbox lag on cp3036 is CRITICAL: CRITICAL: expiry mailbox lag is 1067601 [15:08:29] RECOVERY - Check Varnish expiry mailbox lag on cp2020 is OK: OK: expiry mailbox lag is 492627 [15:14:09] PROBLEM - Check Varnish expiry mailbox lag on cp3049 is CRITICAL: CRITICAL: expiry mailbox lag is 1281891 [15:14:59] PROBLEM - Check Varnish expiry mailbox lag on cp3046 is CRITICAL: CRITICAL: expiry mailbox lag is 1132181 [15:16:59] PROBLEM - Check Varnish expiry mailbox lag on cp3047 is CRITICAL: CRITICAL: expiry mailbox lag is 1308670 [15:17:29] PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 615547 [15:20:09] PROBLEM - Check Varnish expiry mailbox lag on cp3035 is CRITICAL: CRITICAL: expiry mailbox lag is 1301270 [15:25:59] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:29:18] RECOVERY - Check Varnish expiry mailbox lag on cp2008 is OK: OK: expiry mailbox lag is 476627 [15:47:58] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:59] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:48] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:50:48] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [15:57:28] RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 0 [15:58:38] PROBLEM - Check Varnish expiry mailbox lag on cp4005 is CRITICAL: CRITICAL: expiry mailbox lag is 789454 [15:59:18] PROBLEM - Check Varnish expiry mailbox lag on cp4006 is CRITICAL: CRITICAL: expiry mailbox lag is 779681 [16:03:37] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3166190 (10BBlack) The last round of bans mentioned above is complete n... [16:04:48] PROBLEM - Check Varnish expiry mailbox lag on cp4007 is CRITICAL: CRITICAL: expiry mailbox lag is 954434 [16:05:42] the mailbox lags are fallout from the ban traffic. they should eventually resolve on their own, and aren't a problem on their own. [16:05:58] PROBLEM - Check Varnish expiry mailbox lag on cp4015 is CRITICAL: CRITICAL: expiry mailbox lag is 756306 [16:06:14] (we alert on them because they tend to be a leading indicator of a different kind of problem that eventually results in 503s when the lag runs away and never recovers) [16:40:18] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 51744.058569 Seconds [16:40:18] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 51745.035897 Seconds [16:40:28] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 51750.038752 Seconds [16:43:18] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [16:43:18] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [16:43:28] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [16:51:58] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:52:48] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [17:12:28] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:58] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:20:57] !log restart nova-api on labnet [17:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:15] !log restart rabbitmq on labcontrol1001 [17:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:48] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:28] (03PS1) 10Volans: Logging: refactored and standardized [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) [17:25:04] !log openstack server delete 970a86ce-2549-4cf3-be91-1f8558ab1b32 (admin-monitoring stuck in build) [17:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:15] !log delete manual on labcontrol all instances in delete state on nodepool [17:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:29] !log nova reset-state on 15 nodepool stuck in deletion nodes, and force-delete [17:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:57] chasemp: thanks [17:40:28] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:43:05] !log service nova-compute restart labvirt1002 [17:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:28] RECOVERY - Check Varnish expiry mailbox lag on cp3036 is OK: OK: expiry mailbox lag is 0 [17:49:18] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:48] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:18:18] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:24:18] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:05] (03CR) 10Alexandros Kosiaris: [C: 031] typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn) [18:35:40] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3166306 (10BBlack) Update: others noticed the serial number didn't change. So, the new part is not yet installed, and we're not sure whether the old part recovered spontaneously, or due to some local act... [18:46:48] Someone broke puppet for phabricator instance. [18:46:53] https://github.com/wikimedia/puppet/commit/63dd320b9d16bb4d4b295aa2b9a12c6abad0d764 [18:48:28] akosiaris hi, how do i fix this error [18:48:29] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item profile::backup::pool in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/backup/host.pp:7 on node phabricator.phabricator.eqiad.wmflabs [18:48:29] Warning: Not using cache on failed catalog [18:48:29] Error: Could not retrieve catalog; skipping run [18:51:21] https://github.com/wikimedia/puppet/commit/a25081633a5573744ab33503527f52e52b4d7856 [18:52:18] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:59:37] (03CR) 10Hashar: [C: 031] "Scheduled for European SWAT on April 10th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic) [19:01:50] (03PS11) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [19:01:53] (03Abandoned) 10Paladox: Gerrit: Add a new polyGerritBaseUrl config [puppet] - 10https://gerrit.wikimedia.org/r/343736 (owner: 10Paladox) [19:02:48] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:58] RECOVERY - Check Varnish expiry mailbox lag on cp3046 is OK: OK: expiry mailbox lag is 0 [19:05:11] (03CR) 10Paladox: "Hi, this is causing puppet failures on gerrit-test3 and phabricator (labs instance). Could this be reverted or fixed please?" [puppet] - 10https://gerrit.wikimedia.org/r/346732 (owner: 10Alexandros Kosiaris) [19:07:17] (03PS1) 10ArielGlenn: fix bug that produced badly named page range files [dumps] - 10https://gerrit.wikimedia.org/r/347182 [19:10:08] RECOVERY - Check Varnish expiry mailbox lag on cp3035 is OK: OK: expiry mailbox lag is 0 [19:14:28] RECOVERY - Check Varnish expiry mailbox lag on cp3049 is OK: OK: expiry mailbox lag is 0 [19:16:58] RECOVERY - Check Varnish expiry mailbox lag on cp3047 is OK: OK: expiry mailbox lag is 0 [19:22:48] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:48] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:40:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:45:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:50:48] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:52:46] (03CR) 10Alexandros Kosiaris: "Reverting is not really possible. This is just the first in a long series of (already merged) commits in a direction we do want to go." [puppet] - 10https://gerrit.wikimedia.org/r/346732 (owner: 10Alexandros Kosiaris) [19:57:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:59:18] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:01:53] (03PS1) 10Alexandros Kosiaris: Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 [20:02:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:08:38] RECOVERY - Check Varnish expiry mailbox lag on cp4005 is OK: OK: expiry mailbox lag is 0 [20:09:18] RECOVERY - Check Varnish expiry mailbox lag on cp4006 is OK: OK: expiry mailbox lag is 0 [20:22:48] PROBLEM - Disk space on cp1052 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=86%) [20:27:18] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:33:28] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:59] (03Draft1) 10Paladox: Phabricator: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347188 [20:41:02] (03PS2) 10Paladox: Phabricator: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347188 [20:42:58] PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:44:48] RECOVERY - Check Varnish expiry mailbox lag on cp4007 is OK: OK: expiry mailbox lag is 0 [20:45:24] (03Draft1) 10Paladox: Gerrit: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347189 [20:45:30] (03PS2) 10Paladox: Gerrit: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347189 [20:55:48] RECOVERY - Disk space on cp1052 is OK: DISK OK [20:56:19] (03PS3) 10Paladox: Gerrit: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347189 [20:56:41] !log removed varnishkafka logs and daemon.log.1 on cp1052 to free disk space and clear alert [20:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:35] ACKNOWLEDGEMENT - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black T162099 [21:01:28] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [21:10:58] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:12:49] (03PS4) 10Paladox: Gerrit: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347189 [21:13:25] (03PS5) 10Paladox: Gerrit: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347189 [21:19:31] (03PS3) 10Paladox: Phabricator: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347188 [21:24:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:29:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:31:08] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:18] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:58] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:32:08] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [21:51:08] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3166371 (10grin) @Nemo_bis uh, these servers are basically idle. Any SPF checking may be okay, fork or otherwise. Thanks for the links, I'll take to browse them. (I am not a puppet guy so that... [21:55:58] RECOVERY - Check Varnish expiry mailbox lag on cp4015 is OK: OK: expiry mailbox lag is 0 [21:56:58] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.82 seconds [21:59:18] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:06:58] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:13:58] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:28:18] RECOVERY - puppet last run on ms-be1036 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [22:33:08] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:40:58] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:02:08] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:37:18] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [23:38:18] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3074281 keys, up 16 days 7 hours - replication_delay is 57