[01:03:21] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:03:41] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:03:41] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:03:41] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:03:51] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:04:01] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:17:11] <icinga-wm>	 PROBLEM - Host cloudnet1003 is DOWN: PING CRITICAL - Packet loss = 100%
[01:18:51] <icinga-wm>	 RECOVERY - Host cloudnet1003 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[02:06:31] <icinga-wm>	 PROBLEM - Check health of redis instance on 6382 on rdb1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6382
[02:06:42] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:06:51] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:06:51] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:11] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:41] <icinga-wm>	 RECOVERY - Check health of redis instance on 6382 on rdb1004 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6382 has 1 databases (db0) with 5767639 keys, up 60 days 51 minutes
[02:08:02] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:08:42] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:45:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:56:31] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[03:26:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 824.67 seconds
[03:26:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[03:30:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[03:46:11] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[03:48:21] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[03:52:22] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.57 seconds
[04:17:14] <wikibugs>	 (03PS12) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079)
[04:18:10] <wikibugs>	 (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[04:18:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[05:09:05] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) I'm CCing Moritz in case he has any advice or other ideas for the below. Basically what we've got is one backend process doing...
[06:17:22] <icinga-wm>	 PROBLEM - HHVM rendering on mw2273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:18:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw2273 is OK: HTTP OK: HTTP/1.1 200 OK - 74405 bytes in 0.454 second response time
[06:29:01] <icinga-wm>	 PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:29:12] <icinga-wm>	 PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/reboot-host],File[/usr/local/sbin/snapshot-manager]
[06:32:13] <icinga-wm>	 PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml]
[06:32:31] <icinga-wm>	 PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/00-nonexistent.conf]
[06:57:42] <icinga-wm>	 RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:01] <icinga-wm>	 RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:37] <wikibugs>	 (03PS1) 10Odder: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343)
[06:59:31] <icinga-wm>	 RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:59:42] <icinga-wm>	 RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:58:20] <wikibugs>	 (03PS1) 10Odder: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342)
[09:29:31] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:30:41] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.423 second response time
[09:39:32] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:42:51] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.714 second response time
[09:49:41] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:52:41] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[09:55:01] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.452 second response time
[09:57:42] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[09:58:21] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:10:12] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:10:41] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.808 second response time
[10:11:46] <zoranzoki21>	 Hi, why gerrit add me every time as reviewer of patch?
[10:12:01] <zoranzoki21>	 For pywikibot/core repository
[10:13:52] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:15:41] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.018 second response time
[10:17:28] <paladox>	 zoranzoki21 you are listed in https://www.mediawiki.org/wiki/Git/Reviewers#pywikibot/core
[10:18:46] <zoranzoki21>	 Paladox: Oh, thanks.. I removed myself from that page
[10:18:51] <paladox>	 ok
[10:19:01] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:30:31] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.571 second response time
[10:33:52] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:38:01] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.628 second response time
[10:41:21] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:50:19] <paladox>	 hmm i get "You don't have permission to access /mailman/subscribe/cloud-announce on this server." on https://lists.wikimedia.org/mailman/subscribe/cloud-announce
[10:58:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:01:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:01:41] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.887 second response time
[11:05:02] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:07:52] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.859 second response time
[11:08:21] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.566 second response time
[11:11:21] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:11:42] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:12:51] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.051 second response time
[11:16:12] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:28:14] <wikibugs>	 (03CR) 10Gehel: "Great! Everything looks perfect! Time to port the `next_nodes()` function from the estool project and start adding tests" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[11:29:22] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.778 second response time
[11:32:51] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:43:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 49.25, 34.65, 24.86
[11:45:12] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.84, 38.24, 27.42
[11:56:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.25, 18.61, 23.18
[11:58:22] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.097 second response time
[12:01:41] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:16:51] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.035 second response time
[12:20:12] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:25:41] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.651 second response time
[12:29:11] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:45:32] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[12:47:51] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:16:42] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.504 second response time
[13:20:02] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:53:12] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.368 second response time
[13:56:41] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:19:32] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[14:20:11] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:20:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled
[14:20:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled
[14:21:11] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[14:21:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[14:21:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy
[14:24:41] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:24:42] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[14:25:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled
[14:25:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled
[14:27:51] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[14:31:21] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:32:47] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:33:48] <godog>	 indeed pdfrender is sick, I'll start taking a look
[14:34:27] <apergos>	 hey, I just saw the page, need a hand?
[14:35:01] <godog>	 sure, I don't know yet what's wrong besides pdfrender is unhappy
[14:35:14] <jynus>	 scbs pdfrender service seems down everywhere
[14:35:57] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[14:36:21] <paravoid>	 hey
[14:36:46] <godog>	 indeed, I'm looking at pdfrender's syslog.log on scb1001 and nothing interesting so far
[14:36:57] <godog>	 hey paravoid 
[14:37:21] <jynus>	 it seems up again
[14:37:31] <jynus>	 but I only see requst logs, no others
[14:37:35] <jynus>	 that is too verbose
[14:38:06] <jynus>	 it needs to output only warnings or errors
[14:38:32] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:39:37] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:08] <jynus>	 systemd says up since 4 hours ago, so maybe overload?
[14:41:38] <godog>	 perhaps, I was checking if sth changed in pdfrender recently
[14:42:03] <jynus>	 I didn't see anything on puppet
[14:42:13] <jynus>	 maybe deploy?
[14:43:41] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[14:43:57] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.006 second response time
[14:44:11] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time
[14:44:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy
[14:44:52] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[14:45:08] <godog>	 doesn't look like there were changes, https://github.com/wikimedia/mediawiki-services-electron-render
[14:45:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[14:45:39] <jynus>	 what about mediawiki enables?
[14:47:07] <jynus>	 only thing I see on grafana is a spike at https://grafana.wikimedia.org/dashboard/db/mediawiki-electronpdfservice?orgId=1&panelId=9&fullscreen&from=1535888935781&to=1535894886989
[14:47:29] <jynus>	 but that was >2h ago
[14:48:34] <godog>	 looks like pdfrender has been restarted now?
[14:48:45] <jynus>	 on which host?
[14:48:47] <apergos>	 it's been flapping for over five hours it seems
[14:48:59] <jynus>	 oh, I see just restarted on 1002
[14:49:04] <apergos>	 I have not restarted it anywhere, still looking for any indication of what went wrong
[14:49:05] <jynus>	 but it wasn't me
[14:50:42] <apergos>	 yes I see it restarted on 1003 about 7 mins ago
[14:51:02] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time
[14:51:50] <jynus>	 High load, I guess? https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=scb1002&var-datasource=eqiad%20prometheus%2Fops&from=1535878841931&to=1535903373000
[14:51:58] <apergos>	 anyways it now seems to be serving things properly over there 
[14:52:30] <jynus>	 so this will need more debugging, but that can wait
[14:54:19] <jynus>	 https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All
[14:54:35] <jynus>	 shows 100% disk utilization where normally there was barely any
[14:56:00] <jynus>	 there is https://phabricator.wikimedia.org/T174916
[14:56:15] <jynus>	 so probably that is not as astable as it could
[14:56:29] <jynus>	 I will comment there and will ping services tomorrow
[14:56:42] <apergos>	 sounds good
[14:57:15] <akosiaris>	 !log restart pdfrender across scb100* cluster
[14:57:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:38] <akosiaris>	 sorry I ended up in wikimedia-overflow and the -o lead me to believe I was in operations
[14:57:47] <apergos>	 oh, so that was you!  heh
[14:57:58] <akosiaris>	 anyway I 've restarted them
[14:57:59] <apergos>	 I was looking around at crontabs and in puppet to see if it somehow autorestarted :-P
[14:58:01] <apergos>	 thanks
[14:58:21] <akosiaris>	 scb needs 2 restarts per that usual bug with Xpra or whatever
[14:58:27] <akosiaris>	 scb1004 needed*
[14:58:37] <jynus>	 thanks,akosiaris we didn't know if it was someone
[14:58:40] <jynus>	 or systemd
[14:59:21] <jynus>	 but we had seen it had restarted
[14:59:23] <apergos>	 you were probably telling us all what you were doing and everything... ah well
[14:59:32] <godog>	 ah! thanks akosiaris 
[15:02:09] <wikibugs>	 10Operations, 10Electron-PDFs, 10Readers-Web-Backlog (Tracking), 10Services (blocked): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 (10jcrespo) There was instability on most or all sdb1* pdfrender services from 16:32 to 16:45, stopped after a manual restart.
[15:02:45] <jynus>	 I left a note on https://phabricator.wikimedia.org/T174916#4551785 , if it is the wrong one at least it is minimally documented
[15:03:03] <apergos>	 that's fine, thanks
[15:03:08] <akosiaris>	 yup, thanks
[15:03:09] <apergos>	 I didn't see anything better in phab
[15:03:27] <jynus>	 I see other alerts due to performance
[15:03:30] <akosiaris>	 pdfrender is supposed to be replaced by proton anyway
[15:03:52] <jynus>	 but they are of like 0.1 seconds regressions, so perforamce domain, not a threat to availability
[15:04:20] <jynus>	 akosiaris: I am very confused with all replacements, and replacements of replacements :-P
[15:04:36] <akosiaris>	 jynus: name them all OCG and just start incrementing numbers :P
[15:04:41] <jynus>	 lol
[15:04:45] <akosiaris>	 this must be ocg v4
[15:04:47] <jynus>	 anyway, have to go, thanks for attending
[15:04:52] <jynus>	 byee
[15:04:56] <akosiaris>	 bye
[15:11:20] <wikibugs>	 (03PS1) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093
[15:14:19] <wikibugs>	 (03PS2) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093
[15:19:40] <wikibugs>	 (03PS3) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093
[15:20:39] <wikibugs>	 10Operations: syncing Ubuntu mirror fail - https://phabricator.wikimedia.org/T203290 (10Dzahn) talked more on #canonical-sysadmin  they mentioned #ubuntu-mirrors.  also reported there.  then sent mail to rt@ubuntu now.
[15:23:58] <wikibugs>	 (03PS4) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093
[15:25:34] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10Dzahn) in LDAP: removed him from the 'ops' group. I meant to add him to "wmf" instead but he...
[15:31:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093 (owner: 10Andrew Bogott)
[16:13:03] <wikibugs>	 (03CR) 10Putnik: [C: 031] "The logo and text look good, I don't see any problems." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder)
[21:21:25] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[21:25:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[22:01:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 380.99 seconds
[22:06:44] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 43.92 seconds
[22:07:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764.78 seconds
[22:16:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 199.14 seconds
[22:26:34] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:26:35] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:26:45] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:27:04] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:27:04] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:27:14] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:27:24] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:27:25] <icinga-wm>	 PROBLEM - SSH on stat1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:29:24] <icinga-wm>	 PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:29:35] <icinga-wm>	 RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0)
[22:49:04] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds
[22:54:45] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:00:54] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[23:01:04] <icinga-wm>	 RECOVERY - configured eth on stat1005 is OK: OK - interfaces up
[23:01:15] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK
[23:01:15] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[23:01:44] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[23:01:54] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[23:02:55] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:13:45] <Krenair>	 anyone know why sometimes icinga just shows 'return code of 255 is out of bounds' for a bunch of checks on hosts from time to time?
[23:18:28] <Reedy>	 software sucks
[23:19:14] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sun 2018-09-02 23:19:04 UTC.
[23:21:28] <Platonides>	 is that return code something set on the check?
[23:43:22] <Hauskatze>	 Reedy: lol