[01:31:43] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ms-be2039 is CRITICAL: cluster=swift device=cciss,13 instance=ms-be2039:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2039var-datasource=codfw%2520prometheus%252Fops
[02:01:57] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be2039 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2039var-datasource=codfw%2520prometheus%252Fops
[02:07:32] <logmsgbot>	 !log bawolff@deploy1001 Synchronized private/PrivateSettings.php: fine-tune antispam measure T212667 (duration: 00m 47s)
[02:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:36] <stashbot>	 T212667: Create mitigations for account creation spam attack [public task] - https://phabricator.wikimedia.org/T212667
[02:57:17] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[03:05:47] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[03:32:45] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:33:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 936.18 seconds
[03:33:49] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.223 second response time
[04:03:24] <wikibugs>	 (03PS2) 10BryanDavis: ssh-key-ldap-lookup: handle missing users [puppet] - 10https://gerrit.wikimedia.org/r/481343 (https://phabricator.wikimedia.org/T201331)
[04:13:11] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:14:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 268.03 seconds
[04:19:09] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[04:32:43] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:35:01] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[04:38:51] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:41:09] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[04:44:59] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:46:05] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[04:49:55] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:52:11] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.001 second response time
[05:02:11] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:04:29] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.012 second response time
[05:08:19] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:18:01] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[05:26:47] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:40:09] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.313 second response time
[05:43:57] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:51:11] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.573 second response time
[05:58:43] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:00:59] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.019 second response time
[06:04:49] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:18:17] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.538 second response time
[06:28:07] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:28:55] <icinga-wm>	 PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.007 second response time
[06:29:25] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:30:27] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[06:32:05] <icinga-wm>	 PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:34:15] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:37:45] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational
[06:38:33] <icinga-wm>	 RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.545 second response time
[06:49:59] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[06:57:29] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:58:07] <icinga-wm>	 RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:04:45] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.995 second response time
[07:08:31] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:10:59] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.767 second response time
[07:14:41] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:16:59] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[07:17:12] <elukey>	 !log restart pdfrender on scb1002 (alarms flapping)
[07:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:45] <wikibugs>	 10Operations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey)
[12:36:23] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[12:37:01] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:38:13] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:38:23] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1
[12:41:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:43:13] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1
[12:43:41] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[12:44:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:45:23] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:45:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:48:05] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1
[12:49:47] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[12:50:33] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1
[12:50:59] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[14:41:11] <icinga-wm>	 PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:12:27] <icinga-wm>	 RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[18:39:03] <wikibugs>	 (03PS6) 10MacFan4000: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694)
[18:59:51] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:47:13] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational
[23:08:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] ssh-key-ldap-lookup: handle missing users [puppet] - 10https://gerrit.wikimedia.org/r/481343 (https://phabricator.wikimedia.org/T201331) (owner: 10BryanDavis)
[23:42:55] <wikibugs>	 (03CR) 10Platonides: [C: 03+1] Add SPF record for wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/477034 (https://phabricator.wikimedia.org/T210134) (owner: 10Revi)
[23:48:54] <wikibugs>	 (03PS3) 10Platonides: Add SPF record for wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/477034 (https://phabricator.wikimedia.org/T210134) (owner: 10Revi)