[00:16:06] <wikibugs>	 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Dzahn) 05Open→03Resolved Back then i wanted to talk more about it but now i am fine with that. As long as one is aware of it. Next time we switch to codfw i might get back to it to adju...
[00:49:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "no difference on contint*  https://puppet-compiler.wmflabs.org/compiler1001/81/" [puppet] - 10https://gerrit.wikimedia.org/r/481201 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar)
[00:52:22] <wikibugs>	 (03PS2) 10Alex Monk: toolforge: add the new cloud region to all_networks [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm)
[00:53:13] <wikibugs>	 (03PS3) 10Alex Monk: network: Add the new cloud region to all_networks [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm)
[00:59:52] <logmsgbot>	 !log aaron@deploy1001 Started deploy [performance/navtiming@b47e9fc]: (no justification provided)
[00:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:57] <logmsgbot>	 !log aaron@deploy1001 Finished deploy [performance/navtiming@b47e9fc]: (no justification provided) (duration: 00m 05s)
[00:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:45] <AaronSchulz>	 !log Deployed b47e9fcfece99 to navtiming
[01:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:16:31] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[01:18:44] <wikibugs>	 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Krinkle) >>! In T212189#4840406, @daniel wrote: > So, overall, it seems like the solution proposed by the Wikidata team is the only one vi...
[01:18:49] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5
[01:20:13] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[01:35:53] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5
[02:05:43] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1714 MB (3% inode=91%)
[02:10:33] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1712 MB (3% inode=91%)
[02:20:21] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1650 MB (3% inode=91%)
[02:41:01] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1755 MB (3% inode=92%)
[02:50:43] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1707 MB (3% inode=92%)
[03:00:29] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1687 MB (3% inode=92%)
[03:05:19] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1525 MB (3% inode=91%)
[03:10:11] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1455 MB (3% inode=91%)
[03:15:05] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1753 MB (3% inode=93%)
[03:21:13] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1447 MB (3% inode=91%)
[03:35:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 864.48 seconds
[03:57:01] <onimisionipe>	 Anybody^
[04:02:30] <onimisionipe>	 Just one slave lagging is probably nothing.
[04:44:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 289.07 seconds
[05:47:17] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[05:48:23] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[05:48:51] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 (owner: 10Tim Starling)
[05:49:43] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[05:52:09] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[05:53:04] <wikibugs>	 (03CR) 10Krinkle: Excimer and Tideways support (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 (owner: 10Tim Starling)
[05:53:05] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 (owner: 10Tim Starling)
[05:53:11] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[05:53:17] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[05:53:23] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[05:54:25] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[05:55:37] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[05:55:39] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[05:58:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[05:59:17] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[06:01:59] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[06:04:19] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[06:09:19] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[06:10:21] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[06:11:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[06:12:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy
[06:12:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[06:19:03] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[06:26:09] <icinga-wm>	 RECOVERY - Disk space on an-coord1001 is OK: DISK OK
[06:56:56] <wikibugs>	 (03CR) 10ArielGlenn: "Thanks for looking at this. I don't understand why nfs port values are being moved out to separate files and duplicated, when they are com" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn)
[09:14:34] <wikibugs>	 10Operations, 10ops-eqiad: frdb1001 RAID controller battery failure - https://phabricator.wikimedia.org/T212556 (10Jgreen) p:05Triage→03Unbreak!
[09:45:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[09:45:19] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1
[09:47:43] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1
[09:48:39] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[09:53:31] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[09:55:57] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[10:11:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[10:14:07] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[10:20:13] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[10:23:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[11:10:50] <wikibugs>	 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) @Krinkle said: > Rather, it starts out on the assumption that we're going to have UI code in production (based on Vue.js) written...
[11:18:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[11:19:51] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[11:45:15] <wikibugs>	 (03PS1) 10Urbanecm: New throttle rule for students writing Wikipedia program [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481240 (https://phabricator.wikimedia.org/T212226)
[13:34:41] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 171.03, 103.26, 51.99
[13:35:55] <icinga-wm>	 PROBLEM - MD RAID on ms-be2018 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0
[13:36:00] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T212560 (10ops-monitoring-bot)
[13:38:57] <icinga-wm>	 PROBLEM - Disk space on ms-be2018 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error
[13:38:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:39:33] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 25.88, 67.98, 52.89
[13:46:47] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[13:52:05] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:13:01] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:13:51] <icinga-wm>	 PROBLEM - DPKG on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:13:53] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ms-be2018 is CRITICAL: cluster=swift device={cciss,10,cciss,11,cciss,12,cciss,13,cciss,5,cciss,6,cciss,7,cciss,8,cciss,9} instance=ms-be2018:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018var-datasource=codfw%2520prometheus%252Fops
[14:13:59] <icinga-wm>	 PROBLEM - MD RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:14:51] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:14:55] <icinga-wm>	 PROBLEM - Disk space on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:15:01] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:15:37] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:16:37] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:16:39] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:16:45] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:16:45] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:16:47] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:16:49] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:09] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:09] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:11] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:17] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:19] <icinga-wm>	 PROBLEM - dhclient process on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:23] <icinga-wm>	 PROBLEM - Disk space on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:35] <icinga-wm>	 PROBLEM - DPKG on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:17:43] <icinga-wm>	 PROBLEM - puppet last run on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:18:11] <icinga-wm>	 PROBLEM - configured eth on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:18:43] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:19:15] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:19:39] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:19:39] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:19:41] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:19:47] <icinga-wm>	 PROBLEM - dhclient process on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:20:29] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:20:57] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2020 is OK: OK - load average: 41.58, 34.91, 28.86
[14:20:57] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[14:20:57] <icinga-wm>	 RECOVERY - dhclient process on ms-be2020 is OK: PROCS OK: 0 processes with command name dhclient
[14:20:59] <icinga-wm>	 RECOVERY - Disk space on ms-be2020 is OK: DISK OK
[14:21:03] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[14:21:09] <icinga-wm>	 RECOVERY - DPKG on ms-be2020 is OK: All packages OK
[14:21:17] <icinga-wm>	 RECOVERY - MD RAID on ms-be2020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[14:21:27] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[14:21:27] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2020 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[14:21:35] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[14:21:35] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[14:21:35] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[14:21:35] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[14:21:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2020 is OK: OK ferm input default policy is set
[14:21:39] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational
[14:21:45] <icinga-wm>	 RECOVERY - configured eth on ms-be2020 is OK: OK - interfaces up
[14:21:59] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2020 is OK: OK: nf_conntrack is 5 % full
[14:21:59] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[14:22:01] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[14:22:01] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[14:22:01] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2020 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[14:23:51] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational
[14:25:09] <icinga-wm>	 PROBLEM - MD RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:17] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:17] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:21] <paladox>	 godog ^^
[14:25:25] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:25] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:27] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:51] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:51] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:53] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:25:55] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:26:01] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:26:09] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:26:13] <icinga-wm>	 PROBLEM - DPKG on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:26:41] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:26:51] <icinga-wm>	 PROBLEM - configured eth on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:27:07] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:27:09] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:27:45] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer
[14:27:57] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[14:27:57] <icinga-wm>	 RECOVERY - configured eth on ms-be2020 is OK: OK - interfaces up
[14:28:03] <icinga-wm>	 RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures
[14:28:13] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2020 is OK: OK: nf_conntrack is 6 % full
[14:28:13] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[14:28:15] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[14:28:15] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2020 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[14:28:15] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[14:28:17] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2020 is OK: OK - load average: 39.15, 37.98, 32.19
[14:28:23] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[14:28:31] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[14:28:37] <icinga-wm>	 RECOVERY - DPKG on ms-be2020 is OK: All packages OK
[14:28:45] <icinga-wm>	 RECOVERY - MD RAID on ms-be2020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[14:28:53] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2020 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[14:28:53] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[14:29:01] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[14:29:01] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[14:29:01] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[14:57:51] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2020 is OK: OK: synced at Sat 2018-12-22 14:57:49 UTC.
[15:58:44] <godog>	 !log reboot ms-be2018, stuck on sd 0:1:0:1: rejecting I/O to offline device
[15:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:15] <godog>	 paladox: thanks ^
[16:01:37] <icinga-wm>	 RECOVERY - Disk space on ms-be2018 is OK: DISK OK
[16:02:13] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[16:02:33] <icinga-wm>	 RECOVERY - MD RAID on ms-be2018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[16:02:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational
[16:09:09] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[16:14:43] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be2018 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018var-datasource=codfw%2520prometheus%252Fops
[16:15:13] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[17:03:47] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[17:05:51] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1717 MB (3% inode=91%)
[17:09:51] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[17:10:43] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1676 MB (3% inode=91%)
[17:11:28] <wikibugs>	 (03PS1) 10Urbanecm: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481244 (https://phabricator.wikimedia.org/T123188)
[17:20:27] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1668 MB (3% inode=91%)
[17:30:11] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1606 MB (3% inode=91%)
[17:41:05] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1767 MB (3% inode=92%)
[17:46:15] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[17:50:47] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1706 MB (3% inode=92%)
[17:55:57] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[18:00:29] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1693 MB (3% inode=92%)
[18:05:19] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1485 MB (3% inode=91%)
[18:10:11] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1484 MB (3% inode=91%)
[18:16:15] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1645 MB (3% inode=92%)
[18:21:09] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1464 MB (3% inode=91%)
[18:30:53] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1619 MB (3% inode=92%)
[18:32:08] <elukey>	 checking --^
[18:34:31] <icinga-wm>	 RECOVERY - Disk space on an-coord1001 is OK: DISK OK
[18:45:29] <elukey>	 !log manually clean up of old log files on an-coord1001 (disk space issues)
[18:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:50] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::hdfs_balancer: fix logrotate path [puppet] - 10https://gerrit.wikimedia.org/r/481248
[18:46:29] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[18:46:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::hadoop::hdfs_balancer: fix logrotate path [puppet] - 10https://gerrit.wikimedia.org/r/481248 (owner: 10Elukey)
[18:48:13] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[18:49:11] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: /mnt/hdfs 0 MB (0% inode=50%)
[18:50:23] <icinga-wm>	 RECOVERY - Disk space on an-coord1001 is OK: DISK OK
[18:51:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[18:55:31] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[19:01:33] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[19:07:37] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[20:35:01] <icinga-wm>	 PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused
[20:41:07] <icinga-wm>	 RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.012 second response time
[21:53:47] <wikibugs>	 (03PS1) 10Paladox: wmlib: Move def function in ini and php_ini within newfunction [puppet] - 10https://gerrit.wikimedia.org/r/481254
[21:54:22] <wikibugs>	 (03PS2) 10Paladox: wmlib: Move def function in ini and php_ini within newfunction [puppet] - 10https://gerrit.wikimedia.org/r/481254
[21:55:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmlib: Move def function in ini and php_ini within newfunction [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)
[21:55:30] <wikibugs>	 (03PS3) 10Paladox: wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254
[21:55:36] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)
[21:56:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)
[21:56:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)
[21:58:30] <wikibugs>	 (03PS4) 10Paladox: wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254
[21:58:37] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)
[22:00:46] <wikibugs>	 (03PS5) 10Paladox: wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254
[22:00:51] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)
[22:02:05] <wikibugs>	 (03CR) 10Paladox: "puppet catalog https://puppet-compiler.wmflabs.org/compiler1001/86/" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)
[22:02:30] <wikibugs>	 (03CR) 10Paladox: "error i got without this:" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)