[01:03:21] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:03:41] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:03:41] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:03:41] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:03:51] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:01] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:17:11] PROBLEM - Host cloudnet1003 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:51] RECOVERY - Host cloudnet1003 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:06:31] PROBLEM - Check health of redis instance on 6382 on rdb1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6382 [02:06:42] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:06:51] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:06:51] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:11] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:41] RECOVERY - Check health of redis instance on 6382 on rdb1004 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6382 has 1 databases (db0) with 5767639 keys, up 60 days 51 minutes [02:08:02] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:08:42] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:45:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:56:31] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:26:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 824.67 seconds [03:26:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:30:52] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:46:11] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:48:21] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:52:22] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.57 seconds [04:17:14] (03PS12) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [04:18:10] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [04:18:41] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [05:09:05] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) I'm CCing Moritz in case he has any advice or other ideas for the below. Basically what we've got is one backend process doing... [06:17:22] PROBLEM - HHVM rendering on mw2273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:18:21] RECOVERY - HHVM rendering on mw2273 is OK: HTTP OK: HTTP/1.1 200 OK - 74405 bytes in 0.454 second response time [06:29:01] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:29:12] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/reboot-host],File[/usr/local/sbin/snapshot-manager] [06:32:13] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:32:31] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/00-nonexistent.conf] [06:57:42] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:37] (03PS1) 10Odder: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) [06:59:31] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:42] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:58:20] (03PS1) 10Odder: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) [09:29:31] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:41] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.423 second response time [09:39:32] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:51] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.714 second response time [09:49:41] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:52:41] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:55:01] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.452 second response time [09:57:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:58:21] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:10:12] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:10:41] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.808 second response time [10:11:46] Hi, why gerrit add me every time as reviewer of patch? [10:12:01] For pywikibot/core repository [10:13:52] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:15:41] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.018 second response time [10:17:28] zoranzoki21 you are listed in https://www.mediawiki.org/wiki/Git/Reviewers#pywikibot/core [10:18:46] Paladox: Oh, thanks.. I removed myself from that page [10:18:51] ok [10:19:01] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:30:31] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.571 second response time [10:33:52] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:01] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.628 second response time [10:41:21] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:50:19] hmm i get "You don't have permission to access /mailman/subscribe/cloud-announce on this server." on https://lists.wikimedia.org/mailman/subscribe/cloud-announce [10:58:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:01:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:01:41] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.887 second response time [11:05:02] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:52] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.859 second response time [11:08:21] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.566 second response time [11:11:21] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:42] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:12:51] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.051 second response time [11:16:12] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:14] (03CR) 10Gehel: "Great! Everything looks perfect! Time to port the `next_nodes()` function from the estool project and start adding tests" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [11:29:22] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.778 second response time [11:32:51] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:01] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 49.25, 34.65, 24.86 [11:45:12] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.84, 38.24, 27.42 [11:56:11] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.25, 18.61, 23.18 [11:58:22] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.097 second response time [12:01:41] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:51] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.035 second response time [12:20:12] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:41] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.651 second response time [12:29:11] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:32] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:47:51] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:16:42] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.504 second response time [13:20:02] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:12] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.368 second response time [13:56:41] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:32] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:20:11] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:22] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled [14:20:31] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled [14:21:11] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [14:21:32] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:21:41] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [14:24:41] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:25:01] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled [14:25:02] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled [14:27:51] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [14:31:21] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:47] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:48] indeed pdfrender is sick, I'll start taking a look [14:34:27] hey, I just saw the page, need a hand? [14:35:01] sure, I don't know yet what's wrong besides pdfrender is unhappy [14:35:14] scbs pdfrender service seems down everywhere [14:35:57] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [14:36:21] hey [14:36:46] indeed, I'm looking at pdfrender's syslog.log on scb1001 and nothing interesting so far [14:36:57] hey paravoid [14:37:21] it seems up again [14:37:31] but I only see requst logs, no others [14:37:35] that is too verbose [14:38:06] it needs to output only warnings or errors [14:38:32] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:39:37] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:08] systemd says up since 4 hours ago, so maybe overload? [14:41:38] perhaps, I was checking if sth changed in pdfrender recently [14:42:03] I didn't see anything on puppet [14:42:13] maybe deploy? [14:43:41] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [14:43:57] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.006 second response time [14:44:11] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [14:44:11] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [14:44:52] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [14:45:08] doesn't look like there were changes, https://github.com/wikimedia/mediawiki-services-electron-render [14:45:12] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:45:39] what about mediawiki enables? [14:47:07] only thing I see on grafana is a spike at https://grafana.wikimedia.org/dashboard/db/mediawiki-electronpdfservice?orgId=1&panelId=9&fullscreen&from=1535888935781&to=1535894886989 [14:47:29] but that was >2h ago [14:48:34] looks like pdfrender has been restarted now? [14:48:45] on which host? [14:48:47] it's been flapping for over five hours it seems [14:48:59] oh, I see just restarted on 1002 [14:49:04] I have not restarted it anywhere, still looking for any indication of what went wrong [14:49:05] but it wasn't me [14:50:42] yes I see it restarted on 1003 about 7 mins ago [14:51:02] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [14:51:50] High load, I guess? https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=scb1002&var-datasource=eqiad%20prometheus%2Fops&from=1535878841931&to=1535903373000 [14:51:58] anyways it now seems to be serving things properly over there [14:52:30] so this will need more debugging, but that can wait [14:54:19] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All [14:54:35] shows 100% disk utilization where normally there was barely any [14:56:00] there is https://phabricator.wikimedia.org/T174916 [14:56:15] so probably that is not as astable as it could [14:56:29] I will comment there and will ping services tomorrow [14:56:42] sounds good [14:57:15] !log restart pdfrender across scb100* cluster [14:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:38] sorry I ended up in wikimedia-overflow and the -o lead me to believe I was in operations [14:57:47] oh, so that was you! heh [14:57:58] anyway I 've restarted them [14:57:59] I was looking around at crontabs and in puppet to see if it somehow autorestarted :-P [14:58:01] thanks [14:58:21] scb needs 2 restarts per that usual bug with Xpra or whatever [14:58:27] scb1004 needed* [14:58:37] thanks,akosiaris we didn't know if it was someone [14:58:40] or systemd [14:59:21] but we had seen it had restarted [14:59:23] you were probably telling us all what you were doing and everything... ah well [14:59:32] ah! thanks akosiaris [15:02:09] 10Operations, 10Electron-PDFs, 10Readers-Web-Backlog (Tracking), 10Services (blocked): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 (10jcrespo) There was instability on most or all sdb1* pdfrender services from 16:32 to 16:45, stopped after a manual restart. [15:02:45] I left a note on https://phabricator.wikimedia.org/T174916#4551785 , if it is the wrong one at least it is minimally documented [15:03:03] that's fine, thanks [15:03:08] yup, thanks [15:03:09] I didn't see anything better in phab [15:03:27] I see other alerts due to performance [15:03:30] pdfrender is supposed to be replaced by proton anyway [15:03:52] but they are of like 0.1 seconds regressions, so perforamce domain, not a threat to availability [15:04:20] akosiaris: I am very confused with all replacements, and replacements of replacements :-P [15:04:36] jynus: name them all OCG and just start incrementing numbers :P [15:04:41] lol [15:04:45] this must be ocg v4 [15:04:47] anyway, have to go, thanks for attending [15:04:52] byee [15:04:56] bye [15:11:20] (03PS1) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093 [15:14:19] (03PS2) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093 [15:19:40] (03PS3) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093 [15:20:39] 10Operations: syncing Ubuntu mirror fail - https://phabricator.wikimedia.org/T203290 (10Dzahn) talked more on #canonical-sysadmin they mentioned #ubuntu-mirrors. also reported there. then sent mail to rt@ubuntu now. [15:23:58] (03PS4) 10Andrew Bogott: designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093 [15:25:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10Dzahn) in LDAP: removed him from the 'ops' group. I meant to add him to "wmf" instead but he... [15:31:23] (03CR) 10Andrew Bogott: [C: 032] designate: install memcached for coordination backend [puppet] - 10https://gerrit.wikimedia.org/r/457093 (owner: 10Andrew Bogott) [16:13:03] (03CR) 10Putnik: [C: 031] "The logo and text look good, I don't see any problems." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [21:21:25] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:25:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:01:05] PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 380.99 seconds [22:06:44] RECOVERY - MariaDB Slave Lag: s8 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 43.92 seconds [22:07:35] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764.78 seconds [22:16:25] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 199.14 seconds [22:26:34] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:26:35] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:26:45] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:27:04] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:27:04] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:27:14] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:27:24] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:27:25] PROBLEM - SSH on stat1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:24] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:35] RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [22:49:04] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:54:45] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:00:54] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [23:01:04] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [23:01:15] RECOVERY - Disk space on stat1005 is OK: DISK OK [23:01:15] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [23:01:44] RECOVERY - DPKG on stat1005 is OK: All packages OK [23:01:54] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [23:02:55] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:13:45] anyone know why sometimes icinga just shows 'return code of 255 is out of bounds' for a bunch of checks on hosts from time to time? [23:18:28] software sucks [23:19:14] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sun 2018-09-02 23:19:04 UTC. [23:21:28] is that return code something set on the check? [23:43:22] Reedy: lol