[01:02:45] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] "CR -1 to prevent accidental merging because of dependency on I011db0e9a2d9da825cf3ac02bfba23b562e052f6 (in operations/puppet repo)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T171372) (owner: 10Ebe123)
[01:05:28] <icinga-wm>	 PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:06:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:06:58] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:07:14] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[01:07:15] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[01:07:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:07:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:07:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:07:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:07:27] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:07:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:07:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:08:17] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy
[01:09:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:09:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy
[01:09:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy
[01:09:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy
[01:09:38] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy
[01:09:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy
[01:10:18] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[01:10:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy
[01:11:17] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[01:11:18] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy
[01:13:44] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3497968 (10Reedy) I don't disagree with Timo above, and I'm guessing #operations will agree. It then fixes the "issues" that arise...
[01:15:17] <harej>	 Reedy: I don't personally object to wm.o/resources but it's not up to me
[01:15:29] <Reedy>	 Who is it upto? :)
[01:16:00] <Reedy>	 Ofcourse, you can ask for foobar.wikimedia.org, but if ops say no, it's a no ;)
[01:22:27] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[01:24:34] <Zppix>	 harej if i may, I think going the "slash" route is more convential and more usual, and not to mention if for some reason someone needs to setup a dev-related thing under resources.wikimedia.org we arent jumping through hoops to find a different subdomain
[01:26:37] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[01:28:05] <chasemp>	 !log conf2002:~# service etcdmirror-conftool-eqiad-wmnet restart (not sure what else to do the service failed)
[01:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:31:35] <Reedy>	 Zppix: What's dev related got to do with it?
[01:31:55] <Reedy>	 Chances are, it's only gonna be used for Wikimania, so can probably be scrapped at some point in the future anyway
[01:31:56] <wikibugs>	 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504048 (10chasemp)
[01:32:05] <wikibugs>	 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504061 (10chasemp) p:05Triage>03Normal
[01:33:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[01:34:07] <icinga-wm>	 RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[01:34:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[01:36:04] <Zppix>	 Reedy idk the meta page looks like it could be long term potientally, but i dont know i just read the task to find out what that was about
[01:36:12] <wikibugs>	 (03PS2) 10Ebe123: Run Lilypond from Firejail [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T171372)
[01:36:28] <wikibugs>	 (03PS2) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T171372)
[01:36:34] <Zppix>	 and i was just relaying the concerns I had that was also mentioned on the task
[01:37:54] <wikibugs>	 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504064 (10chasemp) Seems like this is dying really soon post restart  ```root@conf2002:~# service etcdmirror-conftool-eqiad-wmnet status ● etcdmirror-conftool-eqiad-wmnet.service - Etcd mirrormaker    Loaded:...
[01:38:03] <wikibugs>	 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504065 (10chasemp) p:05Normal>03Unbreak!
[01:39:43] <harej>	 the shortcut would be used beyond wikimania. in any case, as i said i am fine with wikimedia.org/resources but I need to ask others first
[01:40:40] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3504068 (10Reedy) The other question is whether you're going to be wanting more redirects for other similar purposes...  In which c...
[01:41:16] <harej>	 Reedy: what do you mean by the education subdomain?
[01:41:26] <Zppix>	 harej i'd assume wikiedu
[01:41:30] <Reedy>	 https://github.com/wikimedia/puppet/blob/7ac5f9a959924a6b51625b713e95c44ed7560ee8/modules/mediawiki/files/apache/sites/redirects/redirects.dat#L112-L140
[01:41:49] <harej>	 huh, interesting
[01:42:03] <Reedy>	 I don't mean you using the education subdomain, ofc
[01:42:09] <Reedy>	 but that pattern may make more sense than
[01:42:16] <Reedy>	 wikimedia.org/resourcesfoobar
[01:42:19] <Reedy>	 wikimedia.org/resourcesfoobaz
[01:42:22] <Reedy>	 wikimedia.org/resourcesbarfoo
[01:42:34] <harej>	 so, resources.wikimedia.org/contributors, stuff like that?
[01:42:46] <Reedy>	 potentially, yup
[01:42:48] <harej>	 that sounds interesting
[01:43:08] <Reedy>	 ofc, that detracts from Timos originally point/suggestion...
[01:43:25] <Reedy>	 But if it's more than a single use... It may have merit doing so
[01:44:34] <harej>	 Thing is, the sub-categorization is liable to change
[01:44:35] <harej>	 The current four-audience setup isn't going to stay forever
[01:44:40] <wikibugs>	 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504071 (10chasemp) I think the relevant portion is probably "...or if the lag is large enough that we're losing etcd events"
[02:01:27] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:02:27] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set
[02:34:07] <icinga-wm>	 PROBLEM - MD RAID on labtestservices2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:35:07] <icinga-wm>	 RECOVERY - MD RAID on labtestservices2003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[02:47:17] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active
[02:47:27] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[02:47:50] <icinga-wm>	 RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time
[02:54:17] <wikibugs>	 (03PS3) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582)
[02:58:10] <wikibugs>	 (03PS3) 10Ebe123: Run Lilypond from Firejail [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T172582)
[03:08:07] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubestage1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[03:09:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubestage1002 is OK: OK ferm input default policy is set
[03:12:08] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[03:13:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set
[03:25:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 817.38 seconds
[03:54:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 208.11 seconds
[04:56:17] <icinga-wm>	 PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds
[04:56:59] <icinga-wm>	 RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3841 bytes in 0.022 second response time
[05:50:07] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:50:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:51:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy
[05:52:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:53:18] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]
[05:54:38] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:54:39] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy
[05:55:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy
[05:55:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy
[05:57:57] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0]
[05:58:02] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10I18n, 10Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3504125 (10Zoranzoki21)
[05:58:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:58:38] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[05:59:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:02:47] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy
[06:03:47] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy
[06:04:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy
[06:05:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy
[06:05:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:06:47] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:07:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:08:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:08:47] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy
[06:08:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[06:09:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy
[06:09:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy
[06:10:47] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy
[06:11:07] <wikibugs>	 (03PS1) 10EBernhardson: Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362
[06:11:22] <wikibugs>	 (03CR) 10EBernhardson: [C: 032] Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362 (owner: 10EBernhardson)
[06:11:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:11:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:12:52] <wikibugs>	 (03Merged) 10jenkins-bot: Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362 (owner: 10EBernhardson)
[06:12:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:12:58] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:13:11] <wikibugs>	 (03CR) 10jenkins-bot: Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362 (owner: 10EBernhardson)
[06:14:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy
[06:15:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy
[06:15:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:17:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[06:17:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:18:07] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[06:18:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy
[06:18:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy
[06:18:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[06:18:34] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/PoolCounterSettings.php: T169498: Reduce cirrus search pool counter to 200 parallel requests cluster wide (duration: 02m 54s)
[06:18:38] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy
[06:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:47] <stashbot>	 T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498
[06:19:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy
[06:20:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy
[06:20:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy
[06:22:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy
[06:22:09] <ebernhardson>	 oddly reducing the pool size from 380 to 200 didn't increase the number of rejected searches...
[06:22:17] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy
[06:27:07] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[06:32:38] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[08:38:02] <wikibugs>	 (03PS1) 10Revi: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630)
[08:44:06] <wikibugs>	 (03PS2) 10Revi: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630)
[10:58:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3504238 (10elukey)
[11:07:06] <wikibugs>	 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504266 (10elukey) p:05Unbreak!>03High
[13:04:07] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[13:05:59] <elukey>	 seems already fixed --^
[13:06:06] <elukey>	 peak of 502s for upload
[13:08:02] <elukey>	 seems all */thumb/* so I'll Cc: godog just in case :)
[13:11:07] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time
[13:12:17] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:13:10] <elukey>	 !log restart pdfrender on scb1002
[13:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:13] <elukey>	 !log powercycle mw2256 - com2 frozen - T163346
[13:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:24] <stashbot>	 T163346: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346
[13:19:07] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
[13:20:17] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3504378 (10elukey) @Papaul the host keep getting in a frozen state, we'd need to re-check what's wrong :(
[15:06:57] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2056889
[16:46:57] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 1
[17:46:18] <wikibugs>	 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3504660 (10fgiunchedi) p:05Low>03High Since thumbor is in production now I'm bumping the priority becaus...
[17:46:38] <wikibugs>	 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3504664 (10fgiunchedi)
[19:21:28] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time
[19:21:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:22:28] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.136 second response time
[19:22:37] <icinga-wm>	 RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.045 second response time
[21:49:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 356.64 seconds
[21:50:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.76 seconds
[21:50:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.85 seconds
[21:50:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 365.36 seconds
[21:50:28] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 378.78 seconds
[21:52:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 9.49 seconds
[21:53:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 13.53 seconds
[21:53:08] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 14.74 seconds
[21:53:08] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 16.14 seconds
[21:53:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds