[00:16:36] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [00:23:31] (03PS2) 10Eevans: Extend classpath via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/313619 (https://phabricator.wikimedia.org/T133395) [00:41:57] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:49:21] PROBLEM - MegaRAID on eventlog1001 is CRITICAL: NRPE: Unable to read output [00:51:52] RECOVERY - MegaRAID on eventlog1001 is OK: OK: no disks configured for RAID [01:09:42] Does the beta cluster have any extra debugging features not available on prod? As in, if I was able to reproduce a bug on production and on the beta cluster but not locally, does the beta cluster get me anywhere special? [01:14:09] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:29:56] andrewbogott: ^ ? [01:47:14] I'm not sure why you'd ping Andrew about a beta cluster question [01:47:32] If you can't reproduce the bug locally but you can on beta, you can log into the beta servers and play around with the code to try to find out what's going on [01:47:38] can't just do that in prod [01:48:54] you'd want deployment-tin.deployment-prep.eqiad.wmflabs, and to `sudo -u jenkins-deploy -s` to be able to write files and run scap sync-file and things [01:51:21] Krenair: ahh cool! Mmm it sez he works with labs [01:51:35] yeah [01:51:40] but this isn't labs infrastructure [01:51:58] mmmmm right [01:52:07] beta just lives inside labs vms [01:52:55] Krenair: is it OK to just do that randomly to the beta cluster? I guess I could put some extra logging code in [01:53:04] I wonder if I have permissionz [01:53:13] yeah [01:53:16] as far as I'm concerned [01:55:22] most wikimedia developers have stopped for the night and even if they hadn't, it's not the end of the world if you accidentally take beta down while using it to debug stuff [01:58:02] Mmmm k [01:58:10] I guess there's no xdebugging or anything [02:01:53] Krenair: thx! [02:09:27] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:27:25] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 12m 51s) [02:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:02] AndyRussG: no xdebug, but some people could allegedly intercept calls with the hhvm debugger [02:33:39] tgr: hmm! on prod or betacluster? [02:33:56] beta [02:34:05] don't know the details though [02:34:22] heard it from matt_flaschen IIRC [02:34:35] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:35:08] tgr: cool, thx!! Yeah also I saw https://wikitech.wikimedia.org/wiki/Debugging_in_production [02:36:14] X-Wikimedia-Debug is very useful but it's not debugging in the sense of halting evaluation of the code and inspecting things [02:36:24] (ie. an interactive debugger, like xdebugÖ [02:36:45] as I understand the built-in hhvm debugger can do that [02:37:31] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:39:01] AndyRussG, are you familiar with mwrepl? Is it enough to do PHP method calls with a debugger, or do you need to debug through a web request? [02:39:19] TestingAccessWrapper can also be useful to bypass accessibility (protected/private) when testing. [02:41:34] As tgr said, if you need to debug through a web request, I think it is possible (at least in Beta), but I don't have the details handy. ebernhardson knows it properly (he's the one that set it up). [02:45:16] I did write up details on that page on debugging the action API, if that's helpful (which Brad then improved): https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debugging_action_API_requests_in_shell [02:45:36] Okay, heading out. [02:53:03] matt_flaschen: thx!!! [02:53:19] yeah I'd really like to debug thru a web request [02:53:52] or at least add in some temporary debug code (on beta) at some key points [03:00:22] tho actually mwrelp might also shed some light... [03:00:30] cya thx again all! [03:02:20] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:39:49] !gs ALVARO MOLINA VALE BASURAAAA.. LALA [05:52:16] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:57:41] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2681457 (10ArielGlenn) ariel@stat1002:~$ groups melodykramer melodykramer : wikidev statistics-privatedata-users ariel@stat1002:~$ ls -l /etc/mysql/conf.d total 20 -... [06:16:49] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:27:44] 06Operations: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2663619 (10Dzahn) https://www.freebsd.org/cgi/man.cgi?query=sftp&sektion=1 put [-afPpr] local-path [remote-path] Upload local-path and store it on the remote machine.... [06:43:11] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2681470 (10Aklapper) Is anybody actively investigating this? / Does this need more investigation? Or did the merged patches eliminate the issue? [06:58:31] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ldap-utils] [07:22:55] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:42:04] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [07:49:26] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:51:01] (03CR) 1001tonythomas: "Looks like I need the same for meta as well :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309015 (owner: 10Alex Monk) [07:56:35] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:03:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:04:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:06:46] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:11:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:16:24] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:23:35] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:24:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:41:19] ACKNOWLEDGEMENT - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Gehel reimage required - T146848 [08:41:19] ACKNOWLEDGEMENT - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Gehel reimage required - T146848 [08:47:08] (03PS1) 1001tonythomas: Lift IP throttling for Amrita University in meta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313648 [08:51:47] hey. can some one reviuew https://gerrit.wikimedia.org/r/#/c/313648/ and save our hackathon ? :D [08:57:59] (03PS1) 10Alexandros Kosiaris: Rework network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/313650 [09:10:14] (03PS1) 10Alexandros Kosiaris: Add replication_pass for eqiad maps servers [labs/private] - 10https://gerrit.wikimedia.org/r/313651 [09:13:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add replication_pass for eqiad maps servers [labs/private] - 10https://gerrit.wikimedia.org/r/313651 (owner: 10Alexandros Kosiaris) [09:35:42] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 231 bytes in 0.329 second response time [09:40:47] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [09:44:44] got the page for tools checker, Caught exception: {'desc': 'Invalid credentials'} [09:46:42] I need to check out of the room now, will check back later but I'm not sure where to start looking [09:47:00] chasemp andrewbogott madhuvishy ^ [09:48:22] Pinging yuvi [09:48:55] Thanks godog. Andrew and I are in airport boarding soon, chase is in flight [09:49:15] 06Operations, 07Puppet, 06Labs, 10Labs-Infrastructure: puppet-enc failure does not produce stderr/stdout printed - https://phabricator.wikimedia.org/T147111#2681529 (10hashar) [09:49:49] Had a look as well, not sure what to debug though [09:50:34] labs puppetmasters are all broken due to puppet-enc returning 1. But that is transient [09:52:58] yuvi should be online [09:53:00] i'm looking too [09:53:07] got a few minutes before flight [09:55:08] I dont think it is much of an emergency. We can probably survives with a few hours of no puppet :) [09:55:44] ldap3.core.exceptions.LDAPBindError: automatic bind not successful - invalidCredentials [09:55:45] :D [09:55:52] 06Operations, 07Puppet, 06Labs, 10Labs-Infrastructure: puppet-enc failure does not produce stderr/stdout printed - https://phabricator.wikimedia.org/T147111#2681541 (10hashar) ``` root@deployment-puppetmaster:~# /usr/local/bin/puppet-enc deployment-db03.deployment-prep.eqiad.wmflabs classes: ['role::mariad... [09:56:56] so same as what godog said earlier [09:58:16] anyway [09:58:40] I would assume one of the ldap server has an issue of some sort [09:59:13] 06Operations, 07Puppet, 06Labs, 10Labs-Infrastructure: puppet-enc failure does not produce stderr/stdout printed - https://phabricator.wikimedia.org/T147111#2681545 (10hashar) The script uses `/etc/ldap.yaml` and is apparently run on the puppet master. It has the servers: - ldap-labs.eqiad.wikimedia.or... [10:01:37] it looks like LDAP is kinda down [10:01:53] 06Operations, 07Puppet, 06Labs, 10Labs-Infrastructure: puppet-enc failure does not produce stderr/stdout printed - https://phabricator.wikimedia.org/T147111#2681548 (10hashar) [10:03:48] serpens has some low disk space DISK WARNING - free space: / 852 MB (4% inode=94%): [10:04:02] but been a warning for 1d and 8hours [10:05:59] checking now too with akosiaris in the hotel lobby [10:13:12] yuvipanda: so the enc is failing due to not reaching ldap? [10:13:22] andrewbogott: pretty much [10:13:30] yuvipanda: if ldap is down I shouldn't be able to log in to a labs instance should I? [10:13:38] andrewbogott: indeed, which is why it's confusing [10:13:39] sudo works [10:13:49] so it's partly down :) [10:13:58] Does the enc have failover or does it only use one of the servers? [10:14:39] andrewbogott: it theoretically uses the failover [10:14:46] :) ok [10:15:13] I'm 5 minutes to boarding but anyway it sounds like it's just time for a slapd restart if moritzm_ isn't around to investigate [10:15:17] Want me to just do that? [10:15:34] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 1.18 seconds [10:15:35] please not yet [10:15:41] 'k [10:15:59] _joe_: mark akosiaris I'm going to come down to the lobby now [10:16:10] there isn't anything user facing rn, except for puppet failures [10:16:11] BCN is a real mess today [10:17:31] <_joe_> because the ENC is working again [10:19:14] (03PS1) 10Alexandros Kosiaris: Revert "Add replication_pass for eqiad maps servers" [labs/private] - 10https://gerrit.wikimedia.org/r/313657 [10:19:45] * andrewbogott withdraws from the fray [10:19:50] have a nice trip! [10:22:58] * apergos peeks in [10:23:12] all this I missed from getting house ready for guest, ha! [10:23:33] I will be around again in about an hour, then gone again in about 3 hours (but phone with me) [10:31:19] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [10:32:27] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.591 second response time [10:38:44] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2681554 (10DaveHMBA) The patch has worked for my draft article https://en.wikipedia.org/wiki/Draft:Flags_of_the_Imperial_Austrian_Army_of_the_Napoleonic_Wars ab... [10:41:30] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:47:33] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [10:48:28] (03Abandoned) 10Alexandros Kosiaris: Revert "Add replication_pass for eqiad maps servers" [labs/private] - 10https://gerrit.wikimedia.org/r/313657 (owner: 10Alexandros Kosiaris) [10:52:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [10:56:04] 06Operations, 07Puppet, 06Labs, 10Labs-Infrastructure: puppet-enc failure does not produce stderr/stdout printed - https://phabricator.wikimedia.org/T147111#2681567 (10hashar) 05Open>03Resolved Got fixed by restarting some services. T147112 is the follow up actionable. [11:03:07] !log ladsgroup@terbium:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki Gautehuus Neuraxıs [11:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:41] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:04:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:06:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:07:12] (03PS5) 10Urbanecm: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [11:08:17] (03CR) 10Urbanecm: "I've added path to HD logos in PS5 because I have no idea why HD path should be in different commit than normal path." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [11:19:45] (03PS1) 10Urbanecm: Add 1.5 and 2x logos for olowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313658 (https://phabricator.wikimedia.org/T146745) [11:25:37] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:26:33] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:26:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:36:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:37:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:50:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:55:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:11:05] PROBLEM - Disk space on serpens is CRITICAL: DISK CRITICAL - free space: / 658 MB (3% inode=94%) [12:24:57] and I'm gone again for several hours, reachable as usual by phone. sorry but we have to be at the place early... [12:27:15] there are a lot of files /var/lib/ldap/labs/log.* (13G worth_ on serpens, I don't know if some/most can be tossed or not [12:27:15] if so that would take care of the space issue [12:27:34] they go all the way back to dec 2015 [12:27:36] gone... [12:43:58] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:47:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:56:09] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:57:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:04:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3428363 keys - replication_delay is 641 [14:21:31] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2681677 (10Multichill) As this one is still open, I'll just continue in this bug. Since a couple of minutes... [14:21:43] 06Operations, 10DNS, 10Phabricator, 10Traffic, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2681679 (10Aklapper) >>! In T137252#2362375, @Dzahn wrote: > Maybe we should get some actual numbers for these things instead of g... [14:37:43] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2681682 (10hoo) We had a few thousand queries that were `TOOL: auxiliary_matcher`, those caused syntax error... [14:43:20] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3421687 keys - replication_delay is 0 [15:10:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 683 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3422618 keys - replication_delay is 683 [15:14:30] 06Operations, 10DNS, 10Phabricator, 10Traffic, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2681698 (10Krenair) That doesn't resolve to the Wikimedia servers, so not at the moment. Could add it in DNS and I expect it'd ser... [15:17:40] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3419080 keys - replication_delay is 0 [15:55:31] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:52] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10Magnus) Oops, forgot the # Fixed now. [16:14:22] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:21:14] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2681741 (10jcrespo) By looking at https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?panelId=... [16:23:10] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:39:32] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:40:55] 06Operations, 10DBA: Add icinga check for all MySQL/MariaDB hosts to check they have the right read_only value - https://phabricator.wikimedia.org/T111766#2681769 (10jcrespo) 05Open>03Resolved Sadly, those, except dbstore1001, are exceptions to the rule: slaves that are also masters due to analytics and la... [16:50:07] 06Operations, 10DBA: Add icinga check for all MySQL/MariaDB hosts to check they have the right read_only value - https://phabricator.wikimedia.org/T111766#2681777 (10jcrespo) @fgiunchedi the functionality itself is cool, and I will use it to check other things, though. For example, soft alerts on threads_conne... [16:51:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:51:42] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [16:52:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:59:11] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [17:02:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:03:26] 06Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Logo for Beta Labs Meta-Wiki is broken - https://phabricator.wikimedia.org/T147116#2681787 (10AlexMonk-WMF) varnish backend is failing to start due to: ```Unknown variable 'beresp.storage_hint' At: ('upload-backend.inc.vcl' Line 8 Pos 21)... [17:03:39] 06Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Upload cache in beta is broken - https://phabricator.wikimedia.org/T147116#2681790 (10AlexMonk-WMF) [17:07:24] 06Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Upload cache in beta is broken - https://phabricator.wikimedia.org/T147116#2681791 (10MZMcBride) Adding @BBlack and @ema as subscribers. [17:07:57] (03PS1) 10Alex Monk: varnish: Fix upload backend support for versions other than 4 [puppet] - 10https://gerrit.wikimedia.org/r/313668 (https://phabricator.wikimedia.org/T147116) [17:09:33] (03CR) 10Alex Monk: "Follows up I024b54ed and I9aef4cb9. Cherry-picked on deployment-puppetmaster, where it seems to have fixed things. I'm not 100% sure about" [puppet] - 10https://gerrit.wikimedia.org/r/313668 (https://phabricator.wikimedia.org/T147116) (owner: 10Alex Monk) [17:12:31] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3424385 keys - replication_delay is 650 [17:35:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3408417 keys - replication_delay is 4 [17:46:00] (03PS1) 10Muehlenhoff: Profile firejail containment for ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313669 [17:56:33] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:05:53] (03PS26) 10Alex Monk: Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [18:06:53] (03CR) 10jenkins-bot: [V: 04-1] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [18:09:55] (03PS27) 10Alex Monk: Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [18:18:55] Heya [18:19:11] This is yuvib [18:20:37] yuvi? [19:28:23] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:53:23] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:16:52] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:17:04] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:21:52] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:31:46] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:34:27] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:39:47] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:47:10] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:51:52] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:54:24] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:11:57] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:11:57] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:12:17] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:57:05] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:09:58] RECOVERY - cassandra service on maps-test2002 is OK: OK - cassandra is active [22:17:20] PROBLEM - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [22:22:07] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [22:40:00] RECOVERY - cassandra service on maps-test2002 is OK: OK - cassandra is active [22:47:23] PROBLEM - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [22:59:47] RECOVERY - cassandra service on maps-test2004 is OK: OK - cassandra is active [23:07:33] PROBLEM - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [23:26:16] (03PS28) 10Alex Monk: Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [23:40:10] RECOVERY - cassandra service on maps-test2002 is OK: OK - cassandra is active [23:47:31] PROBLEM - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed