[00:00:12] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on einsteinium is CRITICAL: 46.8 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:01:03] ^ expected [00:01:14] "traffic drop" alert is just too heuristic and uninformed [00:01:43] bblack, ping/page me if needed, I'll keep an eye on my phone, I'm 30min away from my laptop [00:02:23] XioNoX: ok, thanks! [00:02:45] in general, our cache hitrate is dropping off a bit with all these fresh caches coming in, which I'm sure is spiking up the reqrates to MW and other backend services a bit [00:04:22] (03PS5) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 [00:05:26] cache_upload was ~97% global hitrate before all of this, now bottomed out of the change curve at 92.5% (should only go up from here) [00:06:18] cache_text is a similar curve, but numbers more like more like 96.5% dropping to 95% [00:06:27] (for cacheable traffic) [00:06:55] so it's not a ton on the text side, shouldn't cause any real problems there [00:07:22] the upload side is a bit crazy, that's a little more than doubling the miss traffic they see at Swift (and indirectly, thumbor) [00:07:50] but also should be within the realm of reason, even if it's much more noticeable [00:09:04] anways, upload is already back to 93.1% while I was typing all that. It will only slowly get better over time. [00:10:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:17:32] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:26:25] 94.3% now [00:26:34] stepping away from the keys! [00:34:12] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:34:31] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on einsteinium is OK: (C)60 le (W)70 le 76.21 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:38:31] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:42:52] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:42:52] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:59:32] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:05:23] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:08:41] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:13:21] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:15:24] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:18:27] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Shahadat) @Jayantanth, I had send email by bn wikivoys email sending system. But you did not reply my mail with your email address. [01:20:22] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:39:02] RECOVERY - puppet last run on ganeti1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:39:12] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:39:22] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:49:02] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:50:51] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:04:41] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [02:14:12] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:19:21] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:44:41] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:49:51] PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:57:31] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:12] RECOVERY - puppet last run on wtp1025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:27:12] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:27:52] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:30:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 921.83 seconds [03:32:12] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:45:51] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 202.45 seconds [03:46:32] PROBLEM - puppet last run on lvs1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:52:41] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:52:41] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:53:01] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:55:22] PROBLEM - puppet last run on wtp1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:56:51] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Krinkle) The prior conversation at T179212 may be relevant here. The two options discussed so far were 1) Use Gerrit repo with static file server, or 2) Use Phabricator. ##### Ger... [03:57:32] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:05:01] PROBLEM - Device not healthy -SMART- on db1072 is CRITICAL: cluster=mysql device=megaraid,10 instance=db1072:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops [04:17:01] RECOVERY - puppet last run on lvs1016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:17:51] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:20:41] RECOVERY - puppet last run on wtp1045 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:23:12] RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:27:14] 10Operations, 10Wikimedia-Mailing-lists: Request new mail list for Vietnam Wikimedians User Group - https://phabricator.wikimedia.org/T204974 (10minhhuy) Thank you @herron, everything seem perfect :) [04:45:02] PROBLEM - SSH cp3032.mgmt on cp3032.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:45:01] RECOVERY - SSH cp3032.mgmt on cp3032.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [06:00:47] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) >>! In T178802#4641742, @Ottomata wrote: > @HaeB do you still need this? Can we roll this back? Yes, until the end of January it looks like (se... [06:11:23] (03PS4) 10Urbanecm: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) (owner: 10MarcoAurelio) [07:06:12] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:19:11] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:31:52] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:40:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:44:52] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:49:01] (03PS1) 10Elukey: Apply the same permissions of an1003 to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464940 (https://phabricator.wikimedia.org/T205509) [07:49:41] (03CR) 10Elukey: [C: 032] Apply the same permissions of an1003 to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464940 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [09:29:07] PROBLEM - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 5 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops [10:29:12] 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Krishna_Chaitanya_Velaga) The old email has been deactivated, please reset the passwords, and add kcvelaga@gmail.com, as new admin. Thanks in advance :) [10:29:22] 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Krishna_Chaitanya_Velaga) 05Resolved>03Open [11:09:07] PROBLEM - High CPU load on API appserver on mw2205 is CRITICAL: CRITICAL - load average: 82.56, 49.97, 26.43 [11:09:37] PROBLEM - HHVM rendering on mw2205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [11:09:46] PROBLEM - Apache HTTP on mw2205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [11:09:46] PROBLEM - Nginx local proxy to apache on mw2205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time [11:10:46] RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 75033 bytes in 1.367 second response time [11:10:47] RECOVERY - Apache HTTP on mw2205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.132 second response time [11:10:47] RECOVERY - Nginx local proxy to apache on mw2205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.173 second response time [11:12:26] RECOVERY - High CPU load on API appserver on mw2205 is OK: OK - load average: 14.49, 31.20, 23.33 [11:22:16] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:07] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 216 bytes in 0.000 second response time [11:32:46] (03CR) 10Volans: [C: 04-1] "I think that this is the wrong way to generalize/abstract it." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [11:35:21] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Volans) >>! In T203674#4646056, @Banyek wrote: > After merging, on one host the pt-kill running in screen should be stopped, then puppet can be... [12:33:11] 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Aklapper) For future reference please file separate tickets for separate requests. This request (mailing list creation) was resolved already in June. The list likely should h... [12:40:34] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) Ferm PR: https://github.com... [12:44:49] (03PS1) 10BBlack: remove FB UA check in our 1h caching for fb_nets [puppet] - 10https://gerrit.wikimedia.org/r/464948 [12:46:29] (03CR) 10BBlack: [C: 032] remove FB UA check in our 1h caching for fb_nets [puppet] - 10https://gerrit.wikimedia.org/r/464948 (owner: 10BBlack) [12:46:57] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:37] !log depool cp1076, apparently has disk issues [12:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:04] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dzahn) I don't think that cloning / committing / pushing is a big hurdle in this case. After all we are talking about Gerrit users not Phabricator. People who already explicitly co... [12:50:47] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:56:17] PROBLEM - Check systemd state on cp1076 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:57:00] !log rebooting cp1076 [12:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:16] PROBLEM - Varnish traffic logger - varnishstatsd on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:57:16] PROBLEM - Varnish traffic logger - varnishmedia on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:57:17] PROBLEM - Varnish traffic logger - varnishreqstats on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:57:36] PROBLEM - Varnish HTCP daemon on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:57:37] PROBLEM - dhclient process on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:57:56] PROBLEM - Webrequests Varnishkafka log producer on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:58:37] PROBLEM - puppet last run on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:59:07] PROBLEM - SSH on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 22: Connection refused [12:59:16] PROBLEM - Freshness of zerofetch successful run file on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:59:17] PROBLEM - MD RAID on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:59:27] PROBLEM - Disk space on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:59:27] PROBLEM - DPKG on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:59:27] PROBLEM - Freshness of OCSP Stapling files on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:59:36] PROBLEM - Confd vcl based reload on cp1076 is CRITICAL: Return code of 255 is out of bounds [12:59:47] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:00:57] PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:26] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:03:27] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:03:27] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:03:27] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:27] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:27] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:36] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:46] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:46] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:46] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:46] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:46] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:47] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:47] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:48] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:48] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:49] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:49] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:03:56] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:56] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:56] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:03:57] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:04:06] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:04:06] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:04:07] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:04:16] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:04:16] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:04:16] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:04:17] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:04:17] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [13:04:26] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:04:26] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:04:26] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:04:27] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1076_v4, cp1076_v6 [13:15:46] RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [13:16:07] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:57] PROBLEM - DPKG on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:17:57] PROBLEM - Check systemd state on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:17:57] PROBLEM - Freshness of OCSP Stapling files on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:06] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3126: Connection refused [13:18:06] PROBLEM - Confd vcl based reload on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:07] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 80: Connection refused [13:18:07] PROBLEM - Varnish HTCP daemon on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:16] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:17] PROBLEM - dhclient process on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:17] PROBLEM - HTTPS Unified RSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:18:17] PROBLEM - HTTPS Unified ECDSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:18:26] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3125: Connection refused [13:18:27] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3121: Connection refused [13:18:36] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3122: Connection refused [13:18:36] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3124: Connection refused [13:18:36] PROBLEM - Webrequests Varnishkafka log producer on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:37] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3123: Connection refused [13:18:37] PROBLEM - configured eth on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:37] PROBLEM - traffic-pool service on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:38] PROBLEM - confd service on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:38] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:46] PROBLEM - Freshness of zerofetch successful run file on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:56] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 3127: Connection refused [13:18:57] PROBLEM - Varnish traffic logger - varnishstatsd on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:57] PROBLEM - Varnish traffic logger - varnishmedia on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:57] PROBLEM - MD RAID on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:18:57] PROBLEM - Disk space on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:19:06] PROBLEM - Varnish traffic logger - varnishreqstats on cp1076 is CRITICAL: Return code of 255 is out of bounds [13:19:57] sorry for the spam! [13:21:57] PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:27] RECOVERY - Varnish HTCP daemon on cp1076 is OK: PROCS OK: 1 process with UID = 116 (vhtcpd), args vhtcpd [13:22:36] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1076 is OK: No errors detected [13:22:36] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 36 ESP OK [13:22:36] RECOVERY - HTTPS Unified ECDSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 338130 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 46 days) [13:22:37] RECOVERY - HTTPS Unified RSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 338131 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 46 days) [13:22:37] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 36 ESP OK [13:22:37] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 36 ESP OK [13:22:37] RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [13:22:38] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [13:22:38] RECOVERY - dhclient process on cp1076 is OK: PROCS OK: 0 processes with command name dhclient [13:22:39] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [13:22:39] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [13:22:40] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [13:22:40] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [13:22:47] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 0.000 second response time [13:22:56] RECOVERY - Webrequests Varnishkafka log producer on cp1076 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [13:22:56] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 36 ESP OK [13:22:56] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 36 ESP OK [13:22:57] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [13:22:57] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [13:22:57] RECOVERY - configured eth on cp1076 is OK: OK - interfaces up [13:22:58] RECOVERY - traffic-pool service on cp1076 is OK: OK - traffic-pool is active [13:22:58] RECOVERY - confd service on cp1076 is OK: OK - confd is active [13:22:59] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 36 ESP OK [13:22:59] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [13:23:00] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1076 is OK: No errors detected [13:23:00] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 36 ESP OK [13:23:06] RECOVERY - SSH on cp1076 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [13:23:06] RECOVERY - Freshness of zerofetch successful run file on cp1076 is OK: OK [13:23:06] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 36 ESP OK [13:23:07] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 36 ESP OK [13:23:07] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 36 ESP OK [13:23:16] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 36 ESP OK [13:23:17] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 36 ESP OK [13:23:17] RECOVERY - Varnish traffic logger - varnishstatsd on cp1076 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root) [13:23:17] RECOVERY - Varnish traffic logger - varnishmedia on cp1076 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [13:23:17] RECOVERY - MD RAID on cp1076 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [13:23:17] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 36 ESP OK [13:23:17] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 36 ESP OK [13:23:18] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 36 ESP OK [13:23:18] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 36 ESP OK [13:23:19] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 36 ESP OK [13:23:19] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 36 ESP OK [13:23:20] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 36 ESP OK [13:23:20] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 36 ESP OK [13:23:21] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 36 ESP OK [13:24:16] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.001 second response time [13:24:17] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.004 second response time [13:24:27] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.004 second response time [13:24:37] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time [13:24:37] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [13:24:37] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [13:24:46] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.000 second response time [13:24:57] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp1076 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time [13:30:47] 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10BBlack) p:05Triage>03Normal [13:43:26] 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10BBlack) [13:43:40] 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10BBlack) Note to future self on a weekday: we should probably dig further via the nvme-cli commands, as there's lots of queryable hardware errorlog/state/status that might give more insight. [13:46:50] !log authdns1001: update gdnsd package to 2.99.9161-beta-1+wmf1 [13:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:03] err whoops wrong paste [13:47:19] !log authdns1001: update gdnsd package to 2.99.9930-beta-1+wmf1 (correction to last msg) [13:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:07] !log multatuli: update gdnsd package to 2.99.9930-beta-1+wmf1 [13:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:37] RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:53:48] (03PS9) 10Paladox: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) [13:53:55] (03PS10) 10Paladox: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) [13:55:17] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:27] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:39:27] RECOVERY - Confd vcl based reload on cp1076 is OK: reload-vcl successfully ran 0h, 0 minutes ago. [15:22:47] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:23:36] PROBLEM - puppet last run on puppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:16] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:37:01] 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Krishna_Chaitanya_Velaga) Hi, yes, it should. Sorry, new to the working of Phabricator, my bad. I'll make a note of that. [15:39:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:41:27] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:41:56] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:54:06] RECOVERY - puppet last run on puppetmaster1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:06:46] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:07] PROBLEM - High lag on wdqs2003 is CRITICAL: 3974 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:13:17] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 1182 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:18:36] PROBLEM - High lag on wdqs2003 is CRITICAL: 4375 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:25:07] PROBLEM - High lag on wdqs2003 is CRITICAL: 4724 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:29:26] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:26] PROBLEM - High lag on wdqs2003 is CRITICAL: 4938 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:33:37] PROBLEM - High lag on wdqs2003 is CRITICAL: 5166 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:42:17] PROBLEM - High lag on wdqs2003 is CRITICAL: 5593 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:48:47] PROBLEM - High lag on wdqs2003 is CRITICAL: 5947 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:54:37] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:54:37] ACKNOWLEDGEMENT - High lag on wdqs2003 is CRITICAL: 6226 ge 3600 Mathew.onipe Looking into this. probably: T206123 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:07:27] !log restarting wdqs-blazegraph on wdqs2003 [17:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:17] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:55] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) So now we just pin the certspotter package to `release a=stretch-backports`? [17:47:26] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) Actually it looks like it wasn't in stretch to stretch-backports has highest priority anyway. So the host just needs package updates..? [17:51:36] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:51:37] PROBLEM - puppet last run on analytics1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:55:07] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:09:59] !log restart Yarn Resource Manager on an-master1002 to force an-master1001 to take the active role back (failed over due to a zk conn issue) [18:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:46] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:57] RECOVERY - puppet last run on analytics1062 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:25:26] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:43:06] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:44:56] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:06] PROBLEM - puppet last run on archiva1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:17] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:47] PROBLEM - puppet last run on dns1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:07] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:10:27] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 906 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:16:08] !log depooling wdqs2003 [19:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:56] PROBLEM - High lag on wdqs2003 is CRITICAL: 1.258e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:17:17] RECOVERY - puppet last run on archiva1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:21:51] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805 (10Ato_01) I have the same probleme. user Ato_01 [19:23:56] 10Operations, 10Cloud-Services: Can't login wikitech - https://phabricator.wikimedia.org/T144805 (10Ato_01) 05Resolved>03Open [19:26:36] RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:27:26] 10Operations, 10Cloud-Services: Can't login wikitech - https://phabricator.wikimedia.org/T144805 (10Ato_01) 05Open>03Resolved [19:27:36] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 132 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:32:16] RECOVERY - puppet last run on dns1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:57] PROBLEM - High lag on wdqs2003 is CRITICAL: 1.22e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:34:57] PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:26] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:43:56] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:46:07] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:59:47] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 49 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:00:57] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:04:07] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.703 second response time [20:07:26] PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:47] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:14:47] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:18:56] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:31:26] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:32:48] PROBLEM - puppet last run on cloudvirt1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:37:57] RECOVERY - puppet last run on cloudvirt1022 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:40:06] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:41:47] PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:43:56] <_joe_> !log restarting apache2 on puppetmaster1001 [20:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:26] PROBLEM - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf],File[/etc/sysctl.d] [20:48:47] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/etc/swift/account.builder],File[/etc/swift/account.ring.gz],File[/etc/swift/container.builder],File[/etc/swift/container.ring.gz] [20:49:17] PROBLEM - puppet last run on ms-be2034 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 4 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/etc/swift/container.ring.gz],File[/etc/swift/object.builder],File[/etc/swift/object.ring.gz],File[/etc/swift/object-1.builder] [20:49:46] PROBLEM - puppet last run on dns4001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/disable-puppet],File[/usr/local/sbin/enable-puppet] [20:50:06] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP] [20:50:56] <_joe_> ok this is due to my restart [20:51:18] <_joe_> things *should* be more stable if I'm right about the cause [21:02:46] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:38] <_joe_> heh apparently not [21:12:07] RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:13:51] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Volans) p:05Normal>03High [21:14:36] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [21:15:06] RECOVERY - puppet last run on dns4001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:15:26] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:18:37] RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:19:07] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:19:37] RECOVERY - puppet last run on ms-be2034 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:20:24] !log repooling wdqs2003: catched up on updater lag [21:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:01] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Volans) Today we got more widespread failures, and @Joe and I had a look at it, we also tried to restart apache2 on puppetmaster1001 without much success, although the issue mostly recovered after but s... [21:27:07] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:27:35] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Volans) [21:28:06] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:31:48] (03CR) 10Andrew Bogott: "To provide a bit of context, here is the current itch that I am trying to scratch:" [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [21:47:16] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:57:36] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:08:41] 10Operations, 10Cloud-Services: User Shizhao can't login to wikitech - https://phabricator.wikimedia.org/T144805 (10Aklapper) [22:15:16] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:17:36] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:42:06] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.466 second response time [22:45:26] PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:45:27] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:57:37] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:26] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 8.126 second response time [23:05:07] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:06:56] PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:06] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.667 second response time [23:11:26] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:13:27] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:14:36] PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:47] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:17:47] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:16] PROBLEM - puppet last run on netmon1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:28:06] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:34:37] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:35:36] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:42:56] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.877 second response time [23:43:06] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:44:37] RECOVERY - puppet last run on netmon1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:45:26] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:46:16] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:47:08] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:47:17] PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:17] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:48:18] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:51:06] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:59:16] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 8.691 second response time