[00:16:06] 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Dzahn) 05Open→03Resolved Back then i wanted to talk more about it but now i am fine with that. As long as one is aware of it. Next time we switch to codfw i might get back to it to adju... [00:49:27] (03CR) 10Dzahn: [C: 03+1] "no difference on contint* https://puppet-compiler.wmflabs.org/compiler1001/81/" [puppet] - 10https://gerrit.wikimedia.org/r/481201 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [00:52:22] (03PS2) 10Alex Monk: toolforge: add the new cloud region to all_networks [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm) [00:53:13] (03PS3) 10Alex Monk: network: Add the new cloud region to all_networks [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm) [00:59:52] !log aaron@deploy1001 Started deploy [performance/navtiming@b47e9fc]: (no justification provided) [00:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:57] !log aaron@deploy1001 Finished deploy [performance/navtiming@b47e9fc]: (no justification provided) (duration: 00m 05s) [00:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:45] !log Deployed b47e9fcfece99 to navtiming [01:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [01:18:44] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Krinkle) >>! In T212189#4840406, @daniel wrote: > So, overall, it seems like the solution proposed by the Wikidata team is the only one vi... [01:18:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [01:20:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [01:35:53] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [02:05:43] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1714 MB (3% inode=91%) [02:10:33] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1712 MB (3% inode=91%) [02:20:21] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1650 MB (3% inode=91%) [02:41:01] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1755 MB (3% inode=92%) [02:50:43] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1707 MB (3% inode=92%) [03:00:29] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1687 MB (3% inode=92%) [03:05:19] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1525 MB (3% inode=91%) [03:10:11] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1455 MB (3% inode=91%) [03:15:05] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1753 MB (3% inode=93%) [03:21:13] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1447 MB (3% inode=91%) [03:35:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 864.48 seconds [03:57:01] Anybody^ [04:02:30] Just one slave lagging is probably nothing. [04:44:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 289.07 seconds [05:47:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [05:48:23] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:48:51] (03CR) 10Krinkle: [C: 03+1] Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 (owner: 10Tim Starling) [05:49:43] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [05:52:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [05:53:04] (03CR) 10Krinkle: Excimer and Tideways support (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 (owner: 10Tim Starling) [05:53:05] (03CR) 10Krinkle: [C: 03+1] Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 (owner: 10Tim Starling) [05:53:11] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [05:53:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:53:23] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [05:54:25] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [05:55:37] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [05:55:39] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [05:58:03] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [05:59:17] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [06:01:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [06:04:19] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [06:09:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [06:10:21] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [06:11:31] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [06:12:43] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [06:12:45] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [06:19:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [06:26:09] RECOVERY - Disk space on an-coord1001 is OK: DISK OK [06:56:56] (03CR) 10ArielGlenn: "Thanks for looking at this. I don't understand why nfs port values are being moved out to separate files and duplicated, when they are com" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [09:14:34] 10Operations, 10ops-eqiad: frdb1001 RAID controller battery failure - https://phabricator.wikimedia.org/T212556 (10Jgreen) p:05Triage→03Unbreak! [09:45:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:45:19] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [09:47:43] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [09:48:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:53:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:55:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [10:11:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [10:14:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [10:20:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [10:23:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [11:10:50] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) @Krinkle said: > Rather, it starts out on the assumption that we're going to have UI code in production (based on Vue.js) written... [11:18:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [11:19:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [11:45:15] (03PS1) 10Urbanecm: New throttle rule for students writing Wikipedia program [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481240 (https://phabricator.wikimedia.org/T212226) [13:34:41] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 171.03, 103.26, 51.99 [13:35:55] PROBLEM - MD RAID on ms-be2018 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [13:36:00] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T212560 (10ops-monitoring-bot) [13:38:57] PROBLEM - Disk space on ms-be2018 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error [13:38:57] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:39:33] RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 25.88, 67.98, 52.89 [13:46:47] PROBLEM - swift-container-updater on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:52:05] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:13:01] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:13:51] PROBLEM - DPKG on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:13:53] PROBLEM - Device not healthy -SMART- on ms-be2018 is CRITICAL: cluster=swift device={cciss,10,cciss,11,cciss,12,cciss,13,cciss,5,cciss,6,cciss,7,cciss,8,cciss,9} instance=ms-be2018:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018var-datasource=codfw%2520prometheus%252Fops [14:13:59] PROBLEM - MD RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:14:51] PROBLEM - swift-object-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:14:55] PROBLEM - Disk space on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:15:01] PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:15:37] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:16:37] PROBLEM - swift-account-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:16:39] PROBLEM - swift-container-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:16:45] PROBLEM - swift-object-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:16:45] PROBLEM - swift-container-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:16:47] PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:16:49] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:09] PROBLEM - swift-account-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:09] PROBLEM - Check size of conntrack table on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:11] PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:17] PROBLEM - very high load average likely xfs on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:19] PROBLEM - dhclient process on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:23] PROBLEM - Disk space on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:35] PROBLEM - DPKG on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:17:43] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:18:11] PROBLEM - configured eth on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:18:43] PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:19:15] PROBLEM - swift-account-reaper on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:19:39] PROBLEM - swift-container-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:19:39] PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:19:41] PROBLEM - swift-object-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:19:47] PROBLEM - dhclient process on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:20:29] PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:20:57] RECOVERY - very high load average likely xfs on ms-be2020 is OK: OK - load average: 41.58, 34.91, 28.86 [14:20:57] RECOVERY - swift-object-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:20:57] RECOVERY - dhclient process on ms-be2020 is OK: PROCS OK: 0 processes with command name dhclient [14:20:59] RECOVERY - Disk space on ms-be2020 is OK: DISK OK [14:21:03] RECOVERY - swift-object-auditor on ms-be2020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:21:09] RECOVERY - DPKG on ms-be2020 is OK: All packages OK [14:21:17] RECOVERY - MD RAID on ms-be2020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:21:27] RECOVERY - swift-account-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:21:27] RECOVERY - swift-container-updater on ms-be2020 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:21:35] RECOVERY - swift-object-updater on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:21:35] RECOVERY - swift-container-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:21:35] RECOVERY - swift-account-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:21:35] RECOVERY - swift-account-reaper on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:21:39] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2020 is OK: OK ferm input default policy is set [14:21:39] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [14:21:45] RECOVERY - configured eth on ms-be2020 is OK: OK - interfaces up [14:21:59] RECOVERY - Check size of conntrack table on ms-be2020 is OK: OK: nf_conntrack is 5 % full [14:21:59] RECOVERY - swift-account-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:22:01] RECOVERY - swift-container-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:22:01] RECOVERY - swift-container-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:22:01] RECOVERY - swift-object-server on ms-be2020 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:23:51] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational [14:25:09] PROBLEM - MD RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:17] PROBLEM - swift-container-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:17] PROBLEM - swift-account-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:21] godog ^^ [14:25:25] PROBLEM - swift-container-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:25] PROBLEM - swift-object-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:27] PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:51] PROBLEM - swift-account-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:51] PROBLEM - Check size of conntrack table on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:53] PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:25:55] PROBLEM - very high load average likely xfs on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:26:01] PROBLEM - swift-object-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:26:09] PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:26:13] PROBLEM - DPKG on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:26:41] PROBLEM - swift-account-reaper on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:26:51] PROBLEM - configured eth on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:27:07] PROBLEM - swift-container-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:27:09] PROBLEM - swift-object-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:27:45] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.126: Connection reset by peer [14:27:57] RECOVERY - swift-account-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:27:57] RECOVERY - configured eth on ms-be2020 is OK: OK - interfaces up [14:28:03] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [14:28:13] RECOVERY - Check size of conntrack table on ms-be2020 is OK: OK: nf_conntrack is 6 % full [14:28:13] RECOVERY - swift-account-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:28:15] RECOVERY - swift-container-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:28:15] RECOVERY - swift-object-server on ms-be2020 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:28:15] RECOVERY - swift-container-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:28:17] RECOVERY - very high load average likely xfs on ms-be2020 is OK: OK - load average: 39.15, 37.98, 32.19 [14:28:23] RECOVERY - swift-object-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:28:31] RECOVERY - swift-object-auditor on ms-be2020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:28:37] RECOVERY - DPKG on ms-be2020 is OK: All packages OK [14:28:45] RECOVERY - MD RAID on ms-be2020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:28:53] RECOVERY - swift-container-updater on ms-be2020 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:28:53] RECOVERY - swift-account-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:29:01] RECOVERY - swift-container-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:29:01] RECOVERY - swift-object-updater on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:29:01] RECOVERY - swift-account-reaper on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:57:51] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2020 is OK: OK: synced at Sat 2018-12-22 14:57:49 UTC. [15:58:44] !log reboot ms-be2018, stuck on sd 0:1:0:1: rejecting I/O to offline device [15:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:15] paladox: thanks ^ [16:01:37] RECOVERY - Disk space on ms-be2018 is OK: DISK OK [16:02:13] RECOVERY - swift-container-updater on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:02:33] RECOVERY - MD RAID on ms-be2018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:02:47] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational [16:09:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [16:14:43] RECOVERY - Device not healthy -SMART- on ms-be2018 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018var-datasource=codfw%2520prometheus%252Fops [16:15:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:03:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:05:51] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1717 MB (3% inode=91%) [17:09:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:10:43] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1676 MB (3% inode=91%) [17:11:28] (03PS1) 10Urbanecm: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481244 (https://phabricator.wikimedia.org/T123188) [17:20:27] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1668 MB (3% inode=91%) [17:30:11] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1606 MB (3% inode=91%) [17:41:05] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1767 MB (3% inode=92%) [17:46:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:50:47] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1706 MB (3% inode=92%) [17:55:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [18:00:29] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1693 MB (3% inode=92%) [18:05:19] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1485 MB (3% inode=91%) [18:10:11] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1484 MB (3% inode=91%) [18:16:15] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1645 MB (3% inode=92%) [18:21:09] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1464 MB (3% inode=91%) [18:30:53] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1619 MB (3% inode=92%) [18:32:08] checking --^ [18:34:31] RECOVERY - Disk space on an-coord1001 is OK: DISK OK [18:45:29] !log manually clean up of old log files on an-coord1001 (disk space issues) [18:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:50] (03PS1) 10Elukey: profile::hadoop::hdfs_balancer: fix logrotate path [puppet] - 10https://gerrit.wikimedia.org/r/481248 [18:46:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:46:56] (03CR) 10Elukey: [C: 03+2] profile::hadoop::hdfs_balancer: fix logrotate path [puppet] - 10https://gerrit.wikimedia.org/r/481248 (owner: 10Elukey) [18:48:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [18:49:11] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: /mnt/hdfs 0 MB (0% inode=50%) [18:50:23] RECOVERY - Disk space on an-coord1001 is OK: DISK OK [18:51:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:55:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [19:01:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [19:07:37] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [20:35:01] PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused [20:41:07] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.012 second response time [21:53:47] (03PS1) 10Paladox: wmlib: Move def function in ini and php_ini within newfunction [puppet] - 10https://gerrit.wikimedia.org/r/481254 [21:54:22] (03PS2) 10Paladox: wmlib: Move def function in ini and php_ini within newfunction [puppet] - 10https://gerrit.wikimedia.org/r/481254 [21:55:23] (03CR) 10jerkins-bot: [V: 04-1] wmlib: Move def function in ini and php_ini within newfunction [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox) [21:55:30] (03PS3) 10Paladox: wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254 [21:55:36] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox) [21:56:28] (03CR) 10jerkins-bot: [V: 04-1] wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox) [21:56:35] (03CR) 10jerkins-bot: [V: 04-1] wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox) [21:58:30] (03PS4) 10Paladox: wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254 [21:58:37] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox) [22:00:46] (03PS5) 10Paladox: wmlib: Fix support for puppet6 in php_ini.rb and ini.rb [puppet] - 10https://gerrit.wikimedia.org/r/481254 [22:00:51] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox) [22:02:05] (03CR) 10Paladox: "puppet catalog https://puppet-compiler.wmflabs.org/compiler1001/86/" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox) [22:02:30] (03CR) 10Paladox: "error i got without this:" [puppet] - 10https://gerrit.wikimedia.org/r/481254 (owner: 10Paladox)