[00:00:12] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on einsteinium is CRITICAL: 46.8 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:01:03] <bblack>	 ^ expected
[00:01:14] <bblack>	 "traffic drop" alert is just too heuristic and uninformed
[00:01:43] <XioNoX>	 bblack, ping/page me if needed, I'll keep an eye on my phone, I'm 30min away from my laptop
[00:02:23] <bblack>	 XioNoX: ok, thanks!
[00:02:45] <bblack>	 in general, our cache hitrate is dropping off a bit with all these fresh caches coming in, which I'm sure is spiking up the reqrates to MW and other backend services a bit
[00:04:22] <wikibugs>	 (03PS5) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485
[00:05:26] <bblack>	 cache_upload was ~97% global hitrate before all of this, now bottomed out of the change curve at 92.5% (should only go up from here)
[00:06:18] <bblack>	 cache_text is a similar curve, but numbers more like more like 96.5% dropping to 95%
[00:06:27] <bblack>	 (for cacheable traffic)
[00:06:55] <bblack>	 so it's not a ton on the text side, shouldn't cause any real problems there
[00:07:22] <bblack>	 the upload side is a bit crazy, that's a little more than doubling the miss traffic they see at Swift (and indirectly, thumbor)
[00:07:50] <bblack>	 but also should be within the realm of reason, even if it's much more noticeable
[00:09:04] <bblack>	 anways, upload is already back to 93.1% while I was typing all that.  It will only slowly get better over time.
[00:10:31] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[00:17:32] <icinga-wm>	 PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:26:25] <bblack>	 94.3% now
[00:26:34] <bblack>	 stepping away from the keys!
[00:34:12] <icinga-wm>	 PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:34:31] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on einsteinium is OK: (C)60 le (W)70 le 76.21 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:38:31] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:42:52] <icinga-wm>	 PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:42:52] <icinga-wm>	 RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[00:59:32] <icinga-wm>	 RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[01:05:23] <icinga-wm>	 PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:08:41] <icinga-wm>	 PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:13:21] <icinga-wm>	 RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:15:24] <icinga-wm>	 RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:18:27] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Shahadat) @Jayantanth, I had send email by bn wikivoys email sending  system. But you did not reply my mail with your email address.
[01:20:22] <icinga-wm>	 PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:39:02] <icinga-wm>	 RECOVERY - puppet last run on ganeti1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:39:12] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:39:22] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:49:02] <icinga-wm>	 PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:50:51] <icinga-wm>	 RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[02:04:41] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[02:14:12] <icinga-wm>	 RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:19:21] <icinga-wm>	 PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:44:41] <icinga-wm>	 RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[02:49:51] <icinga-wm>	 PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:57:31] <icinga-wm>	 PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:20:12] <icinga-wm>	 RECOVERY - puppet last run on wtp1025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[03:27:12] <icinga-wm>	 PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:27:52] <icinga-wm>	 RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[03:30:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 921.83 seconds
[03:32:12] <icinga-wm>	 PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:45:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 202.45 seconds
[03:46:32] <icinga-wm>	 PROBLEM - puppet last run on lvs1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:52:41] <icinga-wm>	 PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:52:41] <icinga-wm>	 RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:53:01] <icinga-wm>	 PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:55:22] <icinga-wm>	 PROBLEM - puppet last run on wtp1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:56:51] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Krinkle) The prior conversation at T179212 may be relevant here. The two options discussed so far were 1) Use Gerrit repo with static file server, or 2) Use Phabricator.  ##### Ger...
[03:57:32] <icinga-wm>	 RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[04:05:01] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db1072 is CRITICAL: cluster=mysql device=megaraid,10 instance=db1072:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops
[04:17:01] <icinga-wm>	 RECOVERY - puppet last run on lvs1016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:17:51] <icinga-wm>	 RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:20:41] <icinga-wm>	 RECOVERY - puppet last run on wtp1045 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[04:23:12] <icinga-wm>	 RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[04:27:14] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request new mail list for Vietnam Wikimedians User Group - https://phabricator.wikimedia.org/T204974 (10minhhuy) Thank you @herron, everything seem perfect :)
[04:45:02] <icinga-wm>	 PROBLEM - SSH cp3032.mgmt on cp3032.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:45:01] <icinga-wm>	 RECOVERY - SSH cp3032.mgmt on cp3032.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0)
[06:00:47] <wikibugs>	 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) >>! In T178802#4641742, @Ottomata wrote: > @HaeB do you still need this?  Can we roll this back? Yes, until the end of January it looks like (se...
[06:11:23] <wikibugs>	 (03PS4) 10Urbanecm: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) (owner: 10MarcoAurelio)
[07:06:12] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:19:11] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:31:52] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:40:41] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:44:52] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:49:01] <wikibugs>	 (03PS1) 10Elukey: Apply the same permissions of an1003 to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464940 (https://phabricator.wikimedia.org/T205509)
[07:49:41] <wikibugs>	 (03CR) 10Elukey: [C: 032] Apply the same permissions of an1003 to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464940 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey)
[09:29:07] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 5 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops
[10:29:12] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Krishna_Chaitanya_Velaga) The old email has been deactivated, please reset the passwords, and add kcvelaga@gmail.com, as new admin. Thanks in advance :)
[10:29:22] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Krishna_Chaitanya_Velaga) 05Resolved>03Open
[11:09:07] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2205 is CRITICAL: CRITICAL - load average: 82.56, 49.97, 26.43
[11:09:37] <icinga-wm>	 PROBLEM - HHVM rendering on mw2205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time
[11:09:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw2205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[11:09:46] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time
[11:10:46] <icinga-wm>	 RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 75033 bytes in 1.367 second response time
[11:10:47] <icinga-wm>	 RECOVERY - Apache HTTP on mw2205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.132 second response time
[11:10:47] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.173 second response time
[11:12:26] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw2205 is OK: OK - load average: 14.49, 31.20, 23.33
[11:22:16] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:25:07] <icinga-wm>	 RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 216 bytes in 0.000 second response time
[11:32:46] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I think that this is the wrong way to generalize/abstract it." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott)
[11:35:21] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Volans) >>! In T203674#4646056, @Banyek wrote: > After merging, on one host the pt-kill running in screen should be stopped, then puppet can be...
[12:33:11] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Aklapper) For future reference please file separate tickets for separate requests. This request (mailing list creation) was resolved already in June.  The list likely should h...
[12:40:34] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) Ferm PR: https://github.com...
[12:44:49] <wikibugs>	 (03PS1) 10BBlack: remove FB UA check in our 1h caching for fb_nets [puppet] - 10https://gerrit.wikimedia.org/r/464948
[12:46:29] <wikibugs>	 (03CR) 10BBlack: [C: 032] remove FB UA check in our 1h caching for fb_nets [puppet] - 10https://gerrit.wikimedia.org/r/464948 (owner: 10BBlack)
[12:46:57] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:49:37] <bblack>	 !log depool cp1076, apparently has disk issues
[12:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:04] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dzahn) I don't think that cloning / committing / pushing is a big hurdle in this case. After all we are talking about Gerrit users not Phabricator. People who already explicitly co...
[12:50:47] <icinga-wm>	 PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:56:17] <icinga-wm>	 PROBLEM - Check systemd state on cp1076 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:57:00] <bblack>	 !log rebooting cp1076
[12:57:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:16] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishstatsd on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:57:16] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishmedia on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:57:17] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:57:36] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:57:37] <icinga-wm>	 PROBLEM - dhclient process on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:57:56] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:58:37] <icinga-wm>	 PROBLEM - puppet last run on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:59:07] <icinga-wm>	 PROBLEM - SSH on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 22: Connection refused
[12:59:16] <icinga-wm>	 PROBLEM - Freshness of zerofetch successful run file on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:59:17] <icinga-wm>	 PROBLEM - MD RAID on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:59:27] <icinga-wm>	 PROBLEM - Disk space on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:59:27] <icinga-wm>	 PROBLEM - DPKG on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:59:27] <icinga-wm>	 PROBLEM - Freshness of OCSP Stapling files on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:59:36] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1076 is CRITICAL: Return code of 255 is out of bounds
[12:59:47] <icinga-wm>	 PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:00:57] <icinga-wm>	 PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100%
[13:03:26] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:03:27] <icinga-wm>	 PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:03:27] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:03:27] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:27] <icinga-wm>	 PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:27] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:36] <icinga-wm>	 PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:46] <icinga-wm>	 PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:46] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:46] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:46] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:46] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:47] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:47] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:48] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:48] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:49] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:49] <icinga-wm>	 PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:03:56] <icinga-wm>	 PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:56] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:56] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:03:57] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:04:06] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:04:06] <icinga-wm>	 PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:04:07] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:04:16] <icinga-wm>	 PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:04:16] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:04:16] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:04:17] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:04:17] <icinga-wm>	 PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[13:04:26] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:04:26] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:04:26] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:04:27] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6
[13:15:46] <icinga-wm>	 RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[13:16:07] <icinga-wm>	 RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:17:57] <icinga-wm>	 PROBLEM - DPKG on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:17:57] <icinga-wm>	 PROBLEM - Check systemd state on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:17:57] <icinga-wm>	 PROBLEM - Freshness of OCSP Stapling files on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:06] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3126: Connection refused
[13:18:06] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:07] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 80 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 80: Connection refused
[13:18:07] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:16] <icinga-wm>	 PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:17] <icinga-wm>	 PROBLEM - dhclient process on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:17] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:18:17] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:18:26] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3125: Connection refused
[13:18:27] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3121: Connection refused
[13:18:36] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3122: Connection refused
[13:18:36] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3124: Connection refused
[13:18:36] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:37] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3123: Connection refused
[13:18:37] <icinga-wm>	 PROBLEM - configured eth on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:37] <icinga-wm>	 PROBLEM - traffic-pool service on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:38] <icinga-wm>	 PROBLEM - confd service on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:38] <icinga-wm>	 PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:46] <icinga-wm>	 PROBLEM - Freshness of zerofetch successful run file on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:56] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3127: Connection refused
[13:18:57] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishstatsd on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:57] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishmedia on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:57] <icinga-wm>	 PROBLEM - MD RAID on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:18:57] <icinga-wm>	 PROBLEM - Disk space on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:19:06] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp1076 is CRITICAL: Return code of 255 is out of bounds
[13:19:57] <bblack>	 sorry for the spam!
[13:21:57] <icinga-wm>	 PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100%
[13:22:27] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp1076 is OK: PROCS OK: 1 process with UID = 116 (vhtcpd), args vhtcpd
[13:22:36] <icinga-wm>	 RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1076 is OK: No errors detected
[13:22:36] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 36 ESP OK
[13:22:36] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 338130 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 46 days)
[13:22:37] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 338131 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 46 days)
[13:22:37] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 36 ESP OK
[13:22:37] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 36 ESP OK
[13:22:37] <icinga-wm>	 RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[13:22:38] <icinga-wm>	 RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK
[13:22:38] <icinga-wm>	 RECOVERY - dhclient process on cp1076 is OK: PROCS OK: 0 processes with command name dhclient
[13:22:39] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK
[13:22:39] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK
[13:22:40] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK
[13:22:40] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK
[13:22:47] <icinga-wm>	 RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 0.000 second response time
[13:22:56] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp1076 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf
[13:22:56] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 36 ESP OK
[13:22:56] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 36 ESP OK
[13:22:57] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK
[13:22:57] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK
[13:22:57] <icinga-wm>	 RECOVERY - configured eth on cp1076 is OK: OK - interfaces up
[13:22:58] <icinga-wm>	 RECOVERY - traffic-pool service on cp1076 is OK: OK - traffic-pool is active
[13:22:58] <icinga-wm>	 RECOVERY - confd service on cp1076 is OK: OK - confd is active
[13:22:59] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 36 ESP OK
[13:22:59] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK
[13:23:00] <icinga-wm>	 RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1076 is OK: No errors detected
[13:23:00] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 36 ESP OK
[13:23:06] <icinga-wm>	 RECOVERY - SSH on cp1076 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0)
[13:23:06] <icinga-wm>	 RECOVERY - Freshness of zerofetch successful run file on cp1076 is OK: OK
[13:23:06] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 36 ESP OK
[13:23:07] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 36 ESP OK
[13:23:07] <icinga-wm>	 RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 36 ESP OK
[13:23:16] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 36 ESP OK
[13:23:17] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 36 ESP OK
[13:23:17] <icinga-wm>	 RECOVERY - Varnish traffic logger - varnishstatsd on cp1076 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root)
[13:23:17] <icinga-wm>	 RECOVERY - Varnish traffic logger - varnishmedia on cp1076 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root)
[13:23:17] <icinga-wm>	 RECOVERY - MD RAID on cp1076 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[13:23:17] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 36 ESP OK
[13:23:17] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 36 ESP OK
[13:23:18] <icinga-wm>	 RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 36 ESP OK
[13:23:18] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 36 ESP OK
[13:23:19] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 36 ESP OK
[13:23:19] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 36 ESP OK
[13:23:20] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 36 ESP OK
[13:23:20] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 36 ESP OK
[13:23:21] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 36 ESP OK
[13:24:16] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.001 second response time
[13:24:17] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 80 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.004 second response time
[13:24:27] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.004 second response time
[13:24:37] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time
[13:24:37] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time
[13:24:37] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time
[13:24:46] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.000 second response time
[13:24:57] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time
[13:30:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10BBlack) p:05Triage>03Normal
[13:43:26] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10BBlack)
[13:43:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10BBlack) Note to future self on a weekday: we should probably dig further via the nvme-cli commands, as there's lots of queryable hardware errorlog/state/status that might give more insight.
[13:46:50] <bblack>	 !log authdns1001: update gdnsd package to 2.99.9161-beta-1+wmf1
[13:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:03] <bblack>	 err whoops wrong paste
[13:47:19] <bblack>	 !log authdns1001: update gdnsd package to 2.99.9930-beta-1+wmf1 (correction to last msg)
[13:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:07] <bblack>	 !log multatuli: update gdnsd package to 2.99.9930-beta-1+wmf1
[13:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:37] <icinga-wm>	 RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:53:48] <wikibugs>	 (03PS9) 10Paladox: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183)
[13:53:55] <wikibugs>	 (03PS10) 10Paladox: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183)
[13:55:17] <icinga-wm>	 PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:20:27] <icinga-wm>	 RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:39:27] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1076 is OK: reload-vcl successfully ran 0h, 0 minutes ago.
[15:22:47] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:23:36] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:29:16] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:37:01] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Krishna_Chaitanya_Velaga) Hi, yes, it should. Sorry, new to the working of Phabricator, my bad. I'll make a note of that.
[15:39:46] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:41:27] <icinga-wm>	 PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:41:56] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:54:06] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:06:46] <icinga-wm>	 RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:11:07] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 3974 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:13:17] <icinga-wm>	 RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 1182 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:18:36] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 4375 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:25:07] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 4724 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:29:26] <icinga-wm>	 PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:29:26] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 4938 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:33:37] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 5166 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:42:17] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 5593 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:48:47] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 5947 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:54:37] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:54:37] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs2003 is CRITICAL: 6226 ge 3600 Mathew.onipe Looking into this. probably: T206123 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[17:07:27] <onimisionipe>	 !log restarting wdqs-blazegraph on wdqs2003
[17:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:17] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:45:55] <wikibugs>	 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) So now we just pin the certspotter package to `release a=stretch-backports`?
[17:47:26] <wikibugs>	 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) Actually it looks like it wasn't in stretch to stretch-backports has highest priority anyway. So the host just needs package updates..?
[17:51:36] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:51:37] <icinga-wm>	 PROBLEM - puppet last run on analytics1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:55:07] <icinga-wm>	 PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:09:59] <elukey>	 !log restart Yarn Resource Manager on an-master1002 to force an-master1001 to take the active role back (failed over due to a zk conn issue)
[18:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:46] <icinga-wm>	 PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:21:57] <icinga-wm>	 RECOVERY - puppet last run on analytics1062 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:25:26] <icinga-wm>	 RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:43:06] <icinga-wm>	 RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:44:56] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:52:06] <icinga-wm>	 PROBLEM - puppet last run on archiva1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:01:17] <icinga-wm>	 PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:01:47] <icinga-wm>	 PROBLEM - puppet last run on dns1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:10:07] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[19:10:27] <icinga-wm>	 RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 906 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:16:08] <onimisionipe>	 !log depooling wdqs2003
[19:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:56] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 1.258e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:17:17] <icinga-wm>	 RECOVERY - puppet last run on archiva1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[19:21:51] <wikibugs>	 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805 (10Ato_01) I have the same probleme. user Ato_01
[19:23:56] <wikibugs>	 10Operations, 10Cloud-Services: Can't login wikitech - https://phabricator.wikimedia.org/T144805 (10Ato_01) 05Resolved>03Open
[19:26:36] <icinga-wm>	 RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[19:27:26] <wikibugs>	 10Operations, 10Cloud-Services: Can't login wikitech - https://phabricator.wikimedia.org/T144805 (10Ato_01) 05Open>03Resolved
[19:27:36] <icinga-wm>	 RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 132 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:32:16] <icinga-wm>	 RECOVERY - puppet last run on dns1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:32:57] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 1.22e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:34:57] <icinga-wm>	 PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:37:26] <icinga-wm>	 PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:43:56] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[19:46:07] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[19:59:47] <icinga-wm>	 RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 49 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[20:00:57] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:04:07] <icinga-wm>	 RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.703 second response time
[20:07:26] <icinga-wm>	 PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:07:47] <icinga-wm>	 RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[20:14:47] <icinga-wm>	 PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:18:56] <icinga-wm>	 PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:31:26] <icinga-wm>	 RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:32:48] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:37:57] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1022 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[20:40:06] <icinga-wm>	 RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:41:47] <icinga-wm>	 PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:43:56] <_joe_>	 !log restarting apache2 on puppetmaster1001
[20:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:26] <icinga-wm>	 PROBLEM - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf],File[/etc/sysctl.d]
[20:48:47] <icinga-wm>	 PROBLEM - puppet last run on ms-be2043 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/etc/swift/account.builder],File[/etc/swift/account.ring.gz],File[/etc/swift/container.builder],File[/etc/swift/container.ring.gz]
[20:49:17] <icinga-wm>	 PROBLEM - puppet last run on ms-be2034 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 4 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/etc/swift/container.ring.gz],File[/etc/swift/object.builder],File[/etc/swift/object.ring.gz],File[/etc/swift/object-1.builder]
[20:49:46] <icinga-wm>	 PROBLEM - puppet last run on dns4001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/disable-puppet],File[/usr/local/sbin/enable-puppet]
[20:50:06] <icinga-wm>	 PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP]
[20:50:56] <_joe_>	 ok this is due to my restart
[20:51:18] <_joe_>	 things *should* be more stable if I'm right about the cause
[21:02:46] <icinga-wm>	 PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:03:38] <_joe_>	 heh apparently not
[21:12:07] <icinga-wm>	 RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:13:51] <wikibugs>	 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Volans) p:05Normal>03High
[21:14:36] <icinga-wm>	 RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[21:15:06] <icinga-wm>	 RECOVERY - puppet last run on dns4001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[21:15:26] <icinga-wm>	 RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:18:37] <icinga-wm>	 RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[21:19:07] <icinga-wm>	 RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[21:19:37] <icinga-wm>	 RECOVERY - puppet last run on ms-be2034 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[21:20:24] <gehel>	 !log repooling wdqs2003: catched up on updater lag
[21:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:01] <wikibugs>	 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Volans) Today we got more widespread failures, and @Joe and I had a look at it, we also tried to restart apache2 on puppetmaster1001 without much success, although the issue mostly recovered after but s...
[21:27:07] <icinga-wm>	 PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:27:35] <wikibugs>	 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Volans)
[21:28:06] <icinga-wm>	 RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:31:48] <wikibugs>	 (03CR) 10Andrew Bogott: "To provide a bit of context, here is the current itch that I am trying to scratch:" [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott)
[21:47:16] <icinga-wm>	 PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:57:36] <icinga-wm>	 RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[22:08:41] <wikibugs>	 10Operations, 10Cloud-Services: User Shizhao can't login to wikitech - https://phabricator.wikimedia.org/T144805 (10Aklapper)
[22:15:16] <icinga-wm>	 PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:17:36] <icinga-wm>	 RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[22:42:06] <icinga-wm>	 RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.466 second response time
[22:45:26] <icinga-wm>	 PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:45:27] <icinga-wm>	 RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[22:57:37] <icinga-wm>	 PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:00:26] <icinga-wm>	 RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 8.126 second response time
[23:05:07] <icinga-wm>	 PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:06:56] <icinga-wm>	 PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:09:06] <icinga-wm>	 RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.667 second response time
[23:11:26] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:13:27] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:14:36] <icinga-wm>	 PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:17:47] <icinga-wm>	 PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:17:47] <icinga-wm>	 PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:19:16] <icinga-wm>	 PROBLEM - puppet last run on netmon1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:28:06] <icinga-wm>	 RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:34:37] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:35:36] <icinga-wm>	 RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[23:42:56] <icinga-wm>	 RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.877 second response time
[23:43:06] <icinga-wm>	 RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:44:37] <icinga-wm>	 RECOVERY - puppet last run on netmon1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:45:26] <icinga-wm>	 PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:46:16] <icinga-wm>	 PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:47:08] <icinga-wm>	 PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:47:17] <icinga-wm>	 PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:48:17] <icinga-wm>	 RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:48:18] <icinga-wm>	 PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:51:06] <icinga-wm>	 PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:59:16] <icinga-wm>	 RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 8.691 second response time