[00:14:05] (03CR) 10Alex Monk: Optionally filter private wiki results in mwgrep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [01:01:25] (03CR) 10Alex Monk: "Ah. Didn't show up on my list because I filter with -label:Verified-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy) [02:19:36] RECOVERY - MegaRAID on db1065 is OK: OK: optimal, 1 logical, 2 physical [02:22:36] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 08m 59s) [02:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Oct 9 02:27:12 UTC 2016 (duration 4m 36s) [02:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [02:34:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:58:35] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:06:45] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:11:28] PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:38:09] RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:52:36] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2702163 (10Krenair) [04:17:37] PROBLEM - Disk space on dubnium is CRITICAL: DISK CRITICAL - free space: / 656 MB (3% inode=94%) [05:38:47] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:02:49] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:11:12] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:35:19] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:29] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:01:50] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:05:49] 06Operations: Create mass message sender user group on Turkish Wikipedia - https://phabricator.wikimedia.org/T147740#2702223 (10Zppix) [07:19:37] Wait, wat? [07:19:58] “User rights management” is now “Change user groups"... [07:20:54] That took me way to long to figure out... [07:24:14] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:28:35] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:11:01] PROBLEM - puppet last run on lvs1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:05] RECOVERY - puppet last run on lvs1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:34] mmmm cp2008 seems frozen, I can even see anything on the mgmt console [08:45:56] !log powercycling cp2008, no ssh and mgmt console frozen [08:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:41] PROBLEM - MariaDB disk space on db1026 is CRITICAL: DISK CRITICAL - free space: /srv 87995 MB (5% inode=99%) [08:47:05] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [08:47:15] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [08:47:25] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [08:47:34] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [08:47:35] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [08:47:35] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [08:47:43] * volans looking (db1026) [08:47:45] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [08:47:47] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [08:47:48] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [08:47:48] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [08:47:49] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [08:47:55] RECOVERY - Host cp2008 is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms [08:48:10] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [08:48:11] thanks [08:48:11] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [08:48:44] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [08:48:44] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [08:48:45] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [08:48:45] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [08:48:45] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [08:48:45] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [08:48:45] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [08:48:46] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [08:48:46] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [08:48:47] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [08:48:47] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [08:48:48] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [08:48:54] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [08:48:54] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [08:48:54] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [08:48:54] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [08:48:58] what are all those? [08:49:13] apergos: ipsec connections to cp2008 [08:49:15] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [08:49:16] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [08:49:25] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [08:49:31] I am checking db1026 [08:49:38] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [08:49:38] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [08:49:40] jynus: back to 117GB [08:49:48] I am online checking too [08:49:53] no 152GB free [08:49:55] elukey: oh :-D [08:49:56] *now [08:50:14] volans: what did you do? [08:50:21] nothing [08:50:24] df :-P [08:50:26] There are really old dumps there, from 2015, can those be deleted if needed? [08:50:32] apergos: yeah we don't have a good way to aggregate on them :( [08:50:36] who knew a df command would free that kind of space :-P [08:50:45] elukey: ok, good to know! [08:50:56] I guess a long running query generating a tmp table then? [08:51:00] 150GB is low for a database [08:51:09] yep that's my theory too [08:51:15] it cannot perform a schema change, for example [08:51:24] jynus: there is 220G free in the pv if needed I just saw [08:51:40] What about those dumps in /srv/dumps from 2015? [08:52:08] there was a spike in connections [08:52:31] checking for weekly patterns [08:53:22] it's not in the dump role though [08:53:31] it's only 55G. I mean it would be nice to get back anyways but [08:53:44] it's in the watchlist, recentchanges, recentchangeslinked, contributions, logpager role [08:54:03] volans: yeah, and those are pretty old 2015 [08:54:27] As apergos said it is only 55G, but hey, better than nothing :) [08:54:32] And there are 220G free in the pv [08:54:34] yes, but that will not help with the pattern [08:54:40] no it won't [08:54:42] RECOVERY - MariaDB disk space on db1026 is OK: DISK OK [08:54:44] the main issue was the long-running sort queries [08:55:06] in any case, db1026 is in the list for decomission [08:55:19] hwo much space do the newer ones have? [08:55:43] 1.5 now, 4TB the new ones [08:55:47] jynus: There is no server serving watchlist, recentchanges and all that for s5 apart from db1026 looks like no? [08:55:50] oh much better [08:56:01] it doesn't seem like 1.xT is much room to maneuver [08:56:08] apergos, yes [08:56:23] there are 2 things pending, get rid of all <=db1050 [08:56:27] and compression [08:56:35] both are WIP tickets [08:56:41] do we know what kind of performance hit we'll get with compression? [08:56:54] or really 'performance impact', I dunno if it will be a hit [08:56:57] very minimal, given our mostly read-only workload [08:56:58] apergos: It shouldn't have a big impact on performance no [08:57:12] we have on of those slave in testing [08:57:18] oh? which one? [08:57:25] or tell me how to figure it out :-) [08:57:44] jynus: We can discuss tomorrow if we need to bump the priority of the compression tickets, to avoid that kind of issues. However, we do need another s5 server to replace this one as this is the only one serving rc for S5 [08:58:11] jynus: in the sorting graph on the grafana dashboard the sort rows has a large spike 8:24-8:29 but doesn't aling with other spikes later for read_next handler and IOPS at 8:37, unless it has done stuff for 8 minutes without sorting and doing IO [08:58:39] when I connected I saw a bunch of long-running queries [08:58:52] which the watchdog take care of after some time [08:58:58] what could it do for 8 minutes only accessing memory... skeptical [08:58:59] *takes [08:59:47] let's get rid of the dumps and log it to minimize issues and then a plan for long term [08:59:55] agreed [09:00:10] hey [09:00:19] morning [09:00:20] marostegui, I delete those? [09:00:32] jynus: Sure, go ahead [09:01:22] I have to be somewhere now (I'm late) but I'll check back in later. three present/former dbs on it seems pretty good to me :-) [09:01:26] *dbas [09:01:26] !log dropping unneded files on db1026 to mitigate disk issues for the next week [09:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:09] At my previous job we used to have a cronjob reporting and logging everyminute the state of the disk, the size of the mysqldir on disk and the show full processlist, and keeping those text files for a week [09:02:10] we usually have an alert at 10% [09:02:13] it was useful for cases like this [09:02:28] to investigate [09:02:30] I have to check [09:03:11] jynus: We can also try to optimize some tables to claim more space if needed during the week [09:03:15] Just an idea [09:03:19] marostegui, no [09:03:28] we need to get rid of those servers [09:03:36] That too :) [09:03:46] [2016-10-09 08:49:16] SERVICE ALERT: [09:03:52] so we got an alarm [09:03:58] but it growed so wuickly [09:04:02] * akosiaris sees issues is under control, goes back to his cave [09:04:06] XDD [09:04:14] that it went to a page in a few seconds [09:04:25] so I think it is a good thing, the alerting [09:04:32] Yeah [09:04:34] if it had continued growing [09:04:35] It is :) [09:04:45] at the same pace it have done a denial of service [09:04:52] Scary [09:04:53] I think the alerting was ok [09:05:13] Well, with the extra 55G we are safe it there are some more temporary tables coming in as a result of a sort ::) [09:05:18] marostegui, we have a job monitoring the disk every minute [09:05:27] in fact right now we have 2 of those [09:05:38] graphite [09:05:41] and prometheus [09:05:49] and they log outside [09:05:50] haha [09:05:57] it took ~4m to get from 186GB free down to 61 [09:06:04] https://grafana.wikimedia.org/dashboard/db/server-board?panelId=17&fullscreen&from=1475917560240&to=1476003720240&var-server=db1026&var-network=eth0 [09:06:15] marostegui, ^ [09:06:27] that is the 5 minute one [09:06:42] :/ [09:07:05] !log chmod o+r /var/lib/varnish/frontend/_.vsm and /var/lib/varnish/cp2008/_.vsm on cp2008 to avoid gmond errors [09:07:10] CPU usage started at 8:36 [09:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:38] and the 1 minute-stats: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=12&fullscreen&from=1475993243468&to=1476004043468&var-server=db1026%3A9100&var-datasource=eqiad%20prometheus%2Fops [09:07:41] Pretty scary query [09:08:18] load arrived to 20, cpu to 1009% [09:08:28] So it was a punctual query? as there is no pattern there for the rest of the days no? [09:08:32] no [09:08:40] no and yes [09:08:43] let me clarify [09:08:43] XD [09:09:17] it was that one on top: https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1026&user=wikiuser&schema=wik&qmode=eq&query=&hours=1 [09:10:03] yes I was looking at the explain of that one few minutes ago [09:10:46] but the times are all after 8:36... not sure is the cause [09:10:48] or the effect [09:11:06] it is the cause [09:11:12] Yeah, I think so too [09:11:16] It is quite a big one [09:11:31] because I saw it creating tmp tables still when I connected [09:12:01] As looks like everything is under control, I am going to go away for a bit and then read back your conclusions as I was in the middle of something (painting a room XD) [09:12:03] which is strange, because usually that is something that happens with api queries [09:12:54] cp2008 seems ok from what I can see, going afk :) [09:14:20] well, I would say doing nothing aside from that delete, because this was a combination of things- query execution, table design, watchdog, disk space, old hardware; and in reality, it only caused some slowdown of the recentchanges views [09:14:40] IT WASNT ME [09:15:13] (03PS1) 10MarcoAurelio: Create 'massmessage-sender' group for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314850 (https://phabricator.wikimedia.org/T147740) [09:15:14] :-) [09:15:28] :) [09:15:58] that and m*rk are really bad nicks on this channel [09:17:08] The trade-off is at least humorous for me [09:17:31] * volans gotta go too ping me if needed [09:18:36] https://en.wikipedia.org/wiki/Slowdown_Virginia is the genesis. I <3 my pings here. [09:19:11] BTW, marostegui (when you go back) a single rc server is not a SPOF, if a server is not available or lagged- it jumps to a main load server [09:19:49] AFAIK, only commons and enwiki causes issues- although for obvious reasons, that is not the ideal state [09:23:51] jynus: Your AFAIK is correct [09:29:09] (03PS1) 10MarcoAurelio: Fix 'massmessage-sender' group for ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314851 (https://phabricator.wikimedia.org/T147743) [09:39:53] (03PS1) 10MarcoAurelio: Send abusefilter hit notifications from es.wikibooks to UDP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314852 (https://phabricator.wikimedia.org/T147744) [09:50:28] 06Operations, 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05Security: db1026 almost run out of space due to ongoing query activity - https://phabricator.wikimedia.org/T147747#2702346 (10jcrespo) [10:56:26] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [10:58:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 751 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3051263 keys - replication_delay is 751 [11:17:43] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:27:28] (03CR) 10Reedy: Optionally filter private wiki results in mwgrep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [11:41:55] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:45:06] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3239 [11:50:05] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1126877 Threads: 1 Questions: 182695925 Slow queries: 6167 Opens: 10187 Flush tables: 2 Open tables: 532 Queries per second avg: 162.125 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:16:57] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:46] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3037280 keys - replication_delay is 0 [12:43:49] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:54:16] (03CR) 10Luke081515: [C: 031] Create 'massmessage-sender' group for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314850 (https://phabricator.wikimedia.org/T147740) (owner: 10MarcoAurelio) [13:27:36] jynus Ah, ok ok didn't know that - but good to know :) [15:08:17] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 605 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3040796 keys - replication_delay is 605 [15:10:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3038966 keys - replication_delay is 0 [15:40:25] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:09] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:55:02] (03CR) 10MarcoAurelio: [C: 031] Raise abuse filter emergency threshold for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314797 (https://phabricator.wikimedia.org/T145765) (owner: 10Dereckson) [17:55:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 679 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3048861 keys - replication_delay is 679 [18:17:28] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] [18:18:47] jesus christ these errors [18:19:59] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] [18:21:06] Do you really need to comment on all of htem? [18:25:21] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] [18:29:39] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2702752 (10MarcoAurelio) Since the wiki has been created and is now live, is this resolved? [18:30:39] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2702753 (10Platonides) @Dereckson I'm not sure that's the best possible subdomain, but if after discussing it they still think that's th... [18:30:44] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] [18:36:14] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] [18:55:09] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:19:11] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:31:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [19:37:12] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:58:11] (03CR) 10Alex Monk: Optionally filter private wiki results in mwgrep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [20:01:32] (03CR) 10Reedy: Optionally filter private wiki results in mwgrep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [20:17:42] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2702826 (10Dereckson) @BBlack Anything tagged Domains go to Traffic. [20:17:52] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2702829 (10Dereckson) a:03Dereckson [20:20:27] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10Zppix) I am no means probably even supposed to comment on this task but in my opinion i think it should be called projgrant or some other variant of that. [20:41:16] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:17] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:17:17] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2702854 (10Dereckson) a:05Dereckson>03None [21:19:39] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2702855 (10Dereckson) {V11} [21:32:46] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2702864 (10Dereckson) Adding the current members of the Project Grants committee so we can get more input. [21:36:53] robh: In a few months, we can order a 2yr birthday cake for https://phabricator.wikimedia.org/T86541 [21:37:17] (03PS2) 10Addshore: Enable simple-json-datasource on prod Grafana [puppet] - 10https://gerrit.wikimedia.org/r/314029 (https://phabricator.wikimedia.org/T147329) [22:40:21] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:57:18] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3034102 keys - replication_delay is 0 [23:06:55] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:06] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:20:27] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:06] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:44:26] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures