[00:43:45] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1856135 (10Denniss) https://commons.wikimedia.org/wiki/File:Megabalanus_coccopoma.jpg Another file to run experiments with - L... [01:27:43] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: puppet fail [01:30:49] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1856148 (10Halfak) @Krenair, it may also make sense to add a relevant row to the `logging` table. [01:54:13] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:25:22] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 10m 04s) [02:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:04] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: puppet fail [03:20:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [03:24:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [03:35:52] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:04] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [03:36:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [03:36:43] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:44] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [03:57:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Dec 6 03:57:02 UTC 2015 (duration 1h 31m 41s) [03:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:57:50] why is that taking so long [04:00:04] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:00:39] lol what [04:01:34] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [04:01:45] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/scap/files/l10nupdate-1;ace7bc642fe4815e62c80daab85fe47a3b2ff608$123-131 [04:02:04] why is it reimplementing foreachwiki? [04:03:13] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:26:44] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:40:33] grrrit-wm: are you alive? [04:40:59] no you aren't [04:41:54] Krenair: the refreshMessageBlobs script is not fast but Sam knocked 2 hours off of it this week by cleaning up junk in the database. Probably no good reason it isn't using foreachwiki. Worth cleaning up I imagine [04:42:13] look at SAL [04:42:55] why did it take 3-4 hours yesterday? [04:43:56] its been taking that long for a while but the message wasn't showing hours before [04:44:24] it pauses for slave lag after ever row right now [04:45:04] Sam made a patch to only pause every 100 rows but we missed the deploy window [04:45:31] there were also hundreds of bogus lang rows that Sam cleaned out yesterday [04:45:59] it's generally a slow script though for sure [05:06:53] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: puppet fail [05:33:55] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 526 MB (5% inode=80%) [05:34:22] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:40:46] !log silver: apt-get clean for disk space [05:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:40:53] mutante: thanks :) [05:41:08] I think the issue is that nutcracker is running wild — can you make any sense of that? [05:41:12] the log is huge [05:41:28] np, got a page about it. was at 5% and 523M left.after this now 1.2G left and 87% [05:41:44] RECOVERY - MariaDB disk space on silver is OK: DISK OK [05:42:15] today’s nutcracker log is 1.1G [05:42:15] and there was the recovery page [05:42:18] that doesn’t seem right :) [05:43:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:43:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:44:32] andrewbogott: sorry, dont know besides that nutcracker is doing stuff [05:44:43] ok — me neither [05:44:50] it’s vaguely possible that this is normal behavior [05:44:54] is it really that much more than usual and didnt just run out of disk now [05:44:59] before logrotate [05:45:52] andrewbogott: let me gzip log.1 [05:46:13] thanks, that’s what I was thinking of trying as well [05:46:17] might be that it compresses way down [05:46:33] if that ends up in size like 2.gz 5o 7.gz then its normal [05:46:36] for that timeframe [05:47:08] but then there’s this: https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&h=silver.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=network_report&c=Virtualization+cluster+eqiad [05:47:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:47:33] hm, 96M? [05:47:41] So I guess it was a normal size [05:47:44] 96M -rw-r----- 1 nutcracker adm 96M Dec 5 06:26 nutcracker.log.1.gz [05:47:47] 87M -rw-r----- 1 nutcracker adm 87M Dec 4 06:26 nutcracker.log.2.gz [05:47:50] 82M -rw-r----- 1 nutcracker adm 82M Dec 3 06:26 nutcracker.log.3.gz [05:47:52] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:48:13] mutante: ok, I’m convinced that nothing in particular is happening, at least with that logfile [05:48:21] yea, also, this is just a 10G partition [05:48:25] weird spikes on that network graph but maybe that’s normal [05:48:25] and not a separate one for /var [05:48:27] so.. [05:48:36] that's just easy to fill up in general [05:48:40] yeah, should figure out about getting those logs someplace else [05:48:59] for now it's fine. 25% free [05:49:00] But this is the second disk space alert this week, after my never having seen one before [05:49:08] the rest can be later [05:49:14] yep [05:49:20] log to a different place somehow [05:49:23] definitely good enough for a Saturday night [05:49:26] yes [05:49:39] i'll afk again then [05:50:00] me too. g’night! [05:50:04] !log silver gzip /var/log/nutcracker.log.1 [05:50:06] good night [05:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:09:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [06:24:52] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [06:30:04] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:03] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:34] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:03] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 4 failures [06:35:22] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:44] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:12] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [07:05:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [07:10:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [07:15:12] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 200 seconds ago with 0 failures [07:57:43] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [08:00:32] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [08:19:01] hmm, another report of thumbs not getting purged [08:21:53] PROBLEM - DPKG on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:21:53] PROBLEM - Disk space on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:13] PROBLEM - salt-minion processes on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:14] PROBLEM - grafana-admin.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:25] PROBLEM - RAID on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:33] PROBLEM - configured eth on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:43] PROBLEM - dhclient process on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:04] PROBLEM - grafana.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:13] PROBLEM - puppet last run on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:14] PROBLEM - Check size of conntrack table on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:29:40] thedj: in ams? [08:30:34] yeah [09:16:15] (03PS4) 10Yuvipanda: dynamicproxy: Increase websocket timeout [puppet] - 10https://gerrit.wikimedia.org/r/256882 (https://phabricator.wikimedia.org/T120335) [09:39:15] (03CR) 10Yuvipanda: [C: 032] k8s: Roll etcd into master role [puppet] - 10https://gerrit.wikimedia.org/r/257173 (owner: 10Yuvipanda) [09:41:12] PROBLEM - NTP on krypton is CRITICAL: NTP CRITICAL: No response from NTP server [10:00:13] (03PS1) 10Yuvipanda: k8s: Use regular puppet cert path [puppet] - 10https://gerrit.wikimedia.org/r/257176 [10:01:03] (03PS2) 10Yuvipanda: k8s: Use regular puppet cert path [puppet] - 10https://gerrit.wikimedia.org/r/257176 [10:01:05] (03PS6) 10Yuvipanda: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) [10:03:15] (03CR) 10Yuvipanda: [C: 032] k8s: Use regular puppet cert path [puppet] - 10https://gerrit.wikimedia.org/r/257176 (owner: 10Yuvipanda) [10:08:52] PROBLEM - SSH on krypton is CRITICAL: Server answer [10:12:43] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:18:04] (03PS5) 10Yuvipanda: dynamicproxy: Increase websocket timeout [puppet] - 10https://gerrit.wikimedia.org/r/256882 (https://phabricator.wikimedia.org/T120335) [10:18:55] (03Abandoned) 10Yuvipanda: [WIP] / HACK: Enforce single ssldir for puppet [puppet] - 10https://gerrit.wikimedia.org/r/256642 (owner: 10Yuvipanda) [10:19:13] (03Abandoned) 10Yuvipanda: puppetmaster: Make sure base::puppet is present [puppet] - 10https://gerrit.wikimedia.org/r/255154 (owner: 10Yuvipanda) [10:19:44] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Increase websocket timeout [puppet] - 10https://gerrit.wikimedia.org/r/256882 (https://phabricator.wikimedia.org/T120335) (owner: 10Yuvipanda) [10:22:23] PROBLEM - SSH on krypton is CRITICAL: Server answer [10:24:04] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:26:13] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:26:53] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:44] PROBLEM - SSH on krypton is CRITICAL: Server answer [10:49:43] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:57:33] PROBLEM - SSH on krypton is CRITICAL: Server answer [11:15:04] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:26:43] PROBLEM - SSH on krypton is CRITICAL: Server answer [11:28:34] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:38:33] PROBLEM - SSH on krypton is CRITICAL: Server answer [11:42:23] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:50:03] PROBLEM - SSH on krypton is CRITICAL: Server answer [11:52:03] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:57:53] PROBLEM - SSH on krypton is CRITICAL: Server answer [12:03:53] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:11:42] PROBLEM - SSH on krypton is CRITICAL: Server answer [12:17:23] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:23:13] PROBLEM - SSH on krypton is CRITICAL: Server answer [12:27:04] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:38:54] PROBLEM - SSH on krypton is CRITICAL: Server answer [12:44:44] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:58:14] PROBLEM - SSH on krypton is CRITICAL: Server answer [13:00:23] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:06:13] PROBLEM - SSH on krypton is CRITICAL: Server answer [13:10:03] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:17:52] PROBLEM - SSH on krypton is CRITICAL: Server answer [13:19:43] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:23:32] (03CR) 10Paladox: Gerrit: use Diffusion for repo browsing (again) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [13:25:33] PROBLEM - SSH on krypton is CRITICAL: Server answer [13:27:32] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:39:23] PROBLEM - SSH on krypton is CRITICAL: Server answer [13:41:23] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:48:06] (03PS1) 10Paladox: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 [13:49:03] PROBLEM - SSH on krypton is CRITICAL: Server answer [13:52:59] (03CR) 10Krinkle: Fix redirections in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [13:54:12] (03CR) 10Paladox: Fix redirections in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [13:54:52] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:57:18] (03CR) 10Krinkle: Fix redirections in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [13:58:29] (03CR) 10Paladox: Fix redirections in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [14:00:22] (03CR) 10Paladox: Fix redirections in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [14:00:28] (03PS2) 10Paladox: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 [14:02:44] PROBLEM - SSH on krypton is CRITICAL: Server answer [14:08:33] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:24:04] PROBLEM - SSH on krypton is CRITICAL: Server answer [14:37:54] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:43:43] PROBLEM - SSH on krypton is CRITICAL: Server answer [15:01:43] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /var 69958 MB (3% inode=99%) [15:14:52] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:20:42] PROBLEM - SSH on krypton is CRITICAL: Server answer [15:26:24] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:32:32] PROBLEM - SSH on krypton is CRITICAL: Server answer [16:03:43] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [16:09:32] PROBLEM - SSH on krypton is CRITICAL: Server answer [16:11:24] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [16:17:14] PROBLEM - SSH on krypton is CRITICAL: Server answer [16:25:24] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /var 70146 MB (3% inode=99%) [17:14:12] RECOVERY - Disk space on restbase1008 is OK: DISK OK [17:36:26] hi, there is some dead local in DB [17:36:32] dead-lock* [17:36:38] https://he.wikipedia.org/w/index.php?title=%D7%94%D7%A4%D7%95%D7%A2%D7%9C_%D7%97%D7%93%D7%A8%D7%94&action=purge [17:39:43] 6operations: Dead-lock in hewiki DB following page move - https://phabricator.wikimedia.org/T120571#1856635 (10eranroz) 3NEW [17:42:47] 6operations: Dead-lock in hewiki DB following page move - https://phabricator.wikimedia.org/T120571#1856643 (10IKhitron) Let me guess that we have infinite loop there - a redirect page to itself. [17:44:52] 6operations: Dead-lock in hewiki DB following page move - https://phabricator.wikimedia.org/T120571#1856644 (10eranroz) Function: WikiPage::insertRedirectEntry Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.22) [18:01:54] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [18:07:38] (03CR) 10Paladox: "Hi I just tested this on my gerrit local install the links show correctly but when clicked still doint redirect properly but the links do " [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [18:29:04] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:34:23] (03PS1) 10BBlack: pybal: whitespace-only: align => [puppet] - 10https://gerrit.wikimedia.org/r/257207 [18:34:25] (03PS1) 10BBlack: pybal: persist journal logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/257208 [18:34:51] (03CR) 10BBlack: [C: 032 V: 032] pybal: whitespace-only: align => [puppet] - 10https://gerrit.wikimedia.org/r/257207 (owner: 10BBlack) [18:37:04] (03PS2) 10BBlack: pybal: persist journal logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/257208 [18:49:39] !log reset auth token for User:QuimGil [18:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:58] (03PS3) 10Paladox: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 [19:43:18] (03PS4) 10Paladox: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 [19:44:45] (03PS5) 10Paladox: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 [19:53:02] (03CR) 10Paladox: "branch seems to show like refs/head/master instead of just being master." [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [21:48:27] !log krypton unresponsive, nothing on console. shutting down, increasing instance ram from 2 to 4g, and rebooting. [21:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:32] RECOVERY - SSH on krypton is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [21:50:53] RECOVERY - grafana-admin.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 534 bytes in 0.005 second response time [21:50:53] RECOVERY - salt-minion processes on krypton is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:51:13] RECOVERY - dhclient process on krypton is OK: PROCS OK: 0 processes with command name dhclient [21:51:13] RECOVERY - configured eth on krypton is OK: OK - interfaces up [21:51:13] RECOVERY - RAID on krypton is OK: OK: no RAID installed [21:51:43] RECOVERY - grafana.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.006 second response time [21:51:43] RECOVERY - DPKG on krypton is OK: All packages OK [21:51:43] RECOVERY - Disk space on krypton is OK: DISK OK [21:51:43] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:51:44] RECOVERY - Check size of conntrack table on krypton is OK: OK: nf_conntrack is 0 % full [22:10:03] RECOVERY - NTP on krypton is OK: NTP OK: Offset -0.001266598701 secs [22:20:12] 6operations, 10Deployment-Systems: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1856865 (10bd808) 3NEW [22:21:13] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1856874 (10bd808) 5Open>3Resolved The nightly l10nupdate cron job seems to be worki... [22:29:13] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [22:51:20] (03PS1) 10BryanDavis: vagrant: Set umask 0002 for wikidev users [puppet] - 10https://gerrit.wikimedia.org/r/257263 (https://phabricator.wikimedia.org/T120472) [22:55:48] bd808: ^ let me know if / when you want me to look at / merge [22:56:15] whenever you have time [22:56:35] I was going to put it up for puppet swat but that's not until Thursday apparently [22:56:50] (03CR) 10Yuvipanda: [C: 032] vagrant: Set umask 0002 for wikidev users [puppet] - 10https://gerrit.wikimedia.org/r/257263 (https://phabricator.wikimedia.org/T120472) (owner: 10BryanDavis) [22:57:05] bd808: yeah we cancelled tuesday's because there's a big ldap migration planned [22:57:17] *nod* [22:57:51] bd808: done [22:58:22] thanks [22:59:23] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [23:13:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [23:20:43] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [23:34:58] (03CR) 10Alex Monk: [C: 04-1] "Should be merged with I20b202be" [puppet] - 10https://gerrit.wikimedia.org/r/256437 (https://phabricator.wikimedia.org/T115965) (owner: 10Reedy) [23:35:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [23:41:25] (03PS1) 10Alex Monk: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/257267 [23:41:28] (03PS1) 10Alex Monk: Don't reimplement foreachwiki in l10nupdate-1 [puppet] - 10https://gerrit.wikimedia.org/r/257268 [23:49:23] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: puppet fail [23:49:54] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: puppet fail [23:51:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0]