[00:30:43] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 329 MB (3% inode=75%) [00:33:52] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:45:53] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:47:12] PROBLEM - DPKG on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:22] PROBLEM - swift-container-replicator on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:22] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:22] PROBLEM - MD RAID on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:32] PROBLEM - swift-object-auditor on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:33] PROBLEM - Check size of conntrack table on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:52] PROBLEM - configured eth on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:52] PROBLEM - swift-container-updater on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:47:53] PROBLEM - Disk space on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:48:02] PROBLEM - swift-object-server on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:48:03] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:48:23] PROBLEM - swift-account-auditor on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:49:02] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:49:22] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:49:52] PROBLEM - swift-container-updater on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:50:03] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:50:33] PROBLEM - Check size of conntrack table on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:51:02] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational [00:51:12] RECOVERY - DPKG on ms-be2016 is OK: All packages OK [00:51:13] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 16.79, 27.20, 27.44 [00:51:13] RECOVERY - swift-container-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [00:51:13] RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [00:51:13] RECOVERY - MD RAID on ms-be2016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:51:23] RECOVERY - swift-object-auditor on ms-be2016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [00:51:32] RECOVERY - Check size of conntrack table on ms-be2016 is OK: OK: nf_conntrack is 4 % full [00:51:43] RECOVERY - configured eth on ms-be2016 is OK: OK - interfaces up [00:51:43] RECOVERY - swift-container-updater on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [00:51:52] RECOVERY - Disk space on ms-be2016 is OK: DISK OK [00:51:53] RECOVERY - swift-object-server on ms-be2016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:51:53] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2016 is OK: OK ferm input default policy is set [01:15:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:31:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:40:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:09:12] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:42:22] PROBLEM - Long running screen/tmux on labstore1007 is CRITICAL: CRIT: Long running SCREEN process. (user: madhuvishy PID: 18051, 1728896s 1728000s). [05:38:12] RECOVERY - Long running screen/tmux on labstore1007 is OK: OK: No SCREEN or tmux processes detected. [06:00:13] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:00:22] PROBLEM - dhclient process on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:00:42] PROBLEM - DPKG on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:00:42] PROBLEM - Check size of conntrack table on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:00:52] PROBLEM - MD RAID on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:00:52] PROBLEM - swift-account-auditor on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:00:53] PROBLEM - swift-container-replicator on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:01:02] PROBLEM - swift-container-server on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:01:02] PROBLEM - swift-account-server on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:01:03] PROBLEM - configured eth on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:01:22] PROBLEM - swift-account-reaper on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:02:13] RECOVERY - swift-account-reaper on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:02:13] RECOVERY - dhclient process on ms-be2016 is OK: PROCS OK: 0 processes with command name dhclient [06:02:33] RECOVERY - DPKG on ms-be2016 is OK: All packages OK [06:02:33] RECOVERY - Check size of conntrack table on ms-be2016 is OK: OK: nf_conntrack is 4 % full [06:02:42] RECOVERY - MD RAID on ms-be2016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:02:43] RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:02:43] RECOVERY - swift-container-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:02:52] RECOVERY - swift-container-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:02:52] RECOVERY - swift-account-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:03:02] RECOVERY - configured eth on ms-be2016 is OK: OK - interfaces up [06:03:12] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2016 is OK: OK ferm input default policy is set [06:28:42] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:29:43] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:02] PROBLEM - swift-object-replicator on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:12] PROBLEM - swift-account-server on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:32] PROBLEM - Disk space on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:32] PROBLEM - very high load average likely xfs on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:32] PROBLEM - swift-container-auditor on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:33] PROBLEM - swift-account-auditor on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:52] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:53] PROBLEM - swift-account-reaper on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:53] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:31:02] PROBLEM - Check size of conntrack table on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:31:03] PROBLEM - swift-object-server on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:31:42] PROBLEM - swift-container-updater on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:32:12] PROBLEM - configured eth on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:32:23] PROBLEM - DPKG on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:32:32] PROBLEM - very high load average likely xfs on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:32:42] PROBLEM - swift-account-replicator on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:32:52] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:32:52] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2017 is OK: OK ferm input default policy is set [06:32:52] RECOVERY - swift-account-reaper on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:32:52] RECOVERY - swift-object-replicator on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:32:53] RECOVERY - Check size of conntrack table on ms-be2017 is OK: OK: nf_conntrack is 4 % full [06:33:02] RECOVERY - swift-object-server on ms-be2017 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:33:03] RECOVERY - configured eth on ms-be2017 is OK: OK - interfaces up [06:33:03] RECOVERY - swift-account-server on ms-be2017 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:33:22] RECOVERY - DPKG on ms-be2017 is OK: All packages OK [06:33:23] RECOVERY - Disk space on ms-be2017 is OK: DISK OK [06:33:23] RECOVERY - very high load average likely xfs on ms-be2017 is OK: OK - load average: 33.31, 35.11, 28.93 [06:33:32] RECOVERY - swift-account-auditor on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:33:32] RECOVERY - swift-container-auditor on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:33:32] RECOVERY - swift-container-updater on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:33:32] RECOVERY - swift-account-replicator on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:33:42] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational [06:35:42] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:51:03] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:58:42] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:38:42] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [12:39:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:45:42] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [12:46:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:16:35] (03PS2) 10Gergő Tisza: Remove edituserjs from existing groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421124 (https://phabricator.wikimedia.org/T190015) [13:16:37] (03PS3) 10Gergő Tisza: Enforce that jseditor is the only group that can edit sitewide JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421125 (https://phabricator.wikimedia.org/T190015) [14:19:52] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.090 second response time [14:34:02] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2034 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [14:59:43] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.092 second response time [15:49:00] (03PS1) 10ArielGlenn: fix up check for createdirs job with cutoff date [dumps] - 10https://gerrit.wikimedia.org/r/428150 [15:50:05] (03CR) 10ArielGlenn: [C: 032] fix up check for createdirs job with cutoff date [dumps] - 10https://gerrit.wikimedia.org/r/428150 (owner: 10ArielGlenn) [15:58:02] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.065 second response time [15:58:02] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.057 second response time [15:58:02] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 75511 bytes in 0.351 second response time [16:23:24] (03PS1) 10ArielGlenn: rerun only the specific stub type(s) that are missing [dumps] - 10https://gerrit.wikimedia.org/r/428151 [16:25:12] PROBLEM - Long running screen/tmux on elastic1020 is CRITICAL: CRIT: Long running tmux process. (user: ebernhardson PID: 19839, 1730633s 1728000s). [16:28:34] (03CR) 10ArielGlenn: [C: 032] rerun only the specific stub type(s) that are missing [dumps] - 10https://gerrit.wikimedia.org/r/428151 (owner: 10ArielGlenn) [16:29:37] !log ariel@tin Started deploy [dumps/dumps@bb7ae96]: creadtedirs date fixup, rerun only missing stub type [16:29:40] !log ariel@tin Finished deploy [dumps/dumps@bb7ae96]: creadtedirs date fixup, rerun only missing stub type (duration: 00m 03s) [16:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:05] (03PS1) 10ArielGlenn: disable cron for partial dumps temporarily [puppet] - 10https://gerrit.wikimedia.org/r/428152 [16:48:02] (03CR) 10ArielGlenn: [C: 032] disable cron for partial dumps temporarily [puppet] - 10https://gerrit.wikimedia.org/r/428152 (owner: 10ArielGlenn) [17:28:54] (03PS1) 10ArielGlenn: add lbzip2 to the snapshots [puppet] - 10https://gerrit.wikimedia.org/r/428153 (https://phabricator.wikimedia.org/T179059) [17:31:21] (03CR) 10ArielGlenn: [C: 032] add lbzip2 to the snapshots [puppet] - 10https://gerrit.wikimedia.org/r/428153 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [18:31:33] (03PS1) 10ArielGlenn: enable lbzip2 use for all decompression parts of recompression jobs [dumps] - 10https://gerrit.wikimedia.org/r/428156 (https://phabricator.wikimedia.org/T179059) [18:43:18] (03PS2) 10ArielGlenn: enable lbzip2 use for all decompression parts of recompression jobs [dumps] - 10https://gerrit.wikimedia.org/r/428156 (https://phabricator.wikimedia.org/T179059) [20:27:12] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0 [20:30:13] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [22:11:52] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.081 second response time