[00:02:02] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/12235/" [puppet] - 10https://gerrit.wikimedia.org/r/455273 (owner: 10Dzahn) [00:04:49] (03CR) 10Dzahn: [C: 031] swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk) [00:07:38] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/12237/" [puppet] - 10https://gerrit.wikimedia.org/r/439791 (https://phabricator.wikimedia.org/T87338) (owner: 10Alex Monk) [00:09:05] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (ircecho)$ git review [00:09:06] Problem running 'git remote update gerrit' [00:09:06] Fetching gerrit [00:09:06] fatal: internal server error [00:09:06] fatal: protocol error: unexpected '4da0a5724ad57811518f33db030974bd683a0f73001eERR internal server er' [00:09:07] error: Could not fetch gerrit [00:09:09] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (ircecho)$ [00:10:43] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/12238/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/453553 (owner: 10Dzahn) [00:17:49] (03CR) 10Dzahn: "instance using this: https://tools.wmflabs.org/openstack-browser/server/relic.toolserver-legacy.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/448811 (owner: 10Dzahn) [00:20:40] Krenair: i have seen it once or twice but rarely..like maybe every couple weeks,but not today [00:21:15] and was uploading quite a bit just now [01:31:17] yay [01:33:56] Testing shinken-wm auth [01:34:05] legoktm, ^ [01:40:10] Testing 123 [01:54:20] (03PS1) 10Alex Monk: ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) [01:54:57] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [02:00:45] (03PS2) 10Alex Monk: ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) [02:01:28] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [02:02:43] (03PS3) 10Alex Monk: ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) [02:13:01] (03PS4) 10Alex Monk: ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) [03:03:35] (03CR) 10Legoktm: ircecho: Add support for authenticating with SASL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [03:04:40] (03CR) 10Alex Monk: ircecho: Add support for authenticating with SASL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [03:05:47] PROBLEM - Filesystem available is greater than filesystem size on ms-be2042 is CRITICAL: cluster=swift device=/dev/sdd1 fstype=xfs instance=ms-be2042:9100 job=node mountpoint=/srv/swift-storage/sdd1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [04:26:48] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:57:08] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:28:47] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/confd-lint-wrap] [06:29:02] 10Operations, 10ops-codfw, 10DBA: d2058: Disk #11 predictive failure - https://phabricator.wikimedia.org/T202798 (10Marostegui) [06:29:22] 10Operations, 10ops-codfw, 10DBA: d2058: Disk #11 predictive failure - https://phabricator.wikimedia.org/T202798 (10Marostegui) p:05Triage>03Normal [06:29:47] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2058 is CRITICAL: cluster=mysql device=cciss,10 instance=db2058:9100 job=node site=codfw Marostegui T202798 - The acknowledgement expires at: 2018-08-28 06:29:34. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2058&var-datasource=codfw%2520prometheus%252Fops [06:32:28] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #11 predictive failure - https://phabricator.wikimedia.org/T202798 (10Marostegui) [06:36:46] 10Operations, 10monitoring: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Volans) [06:36:48] 10Operations, 10monitoring: add icinga1001 to allowed hosts for AQL SMS gateway - https://phabricator.wikimedia.org/T202784 (10Volans) 05Open>03Resolved a:03Volans @Dzahn I've added it to the AQL whitelist. For the record the account is in pwstore. [06:42:08] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:48:37] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:59:07] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:53:57] (03CR) 10Paladox: ircecho: Add support for authenticating with SASL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [09:49:57] RECOVERY - Memory correctable errors -EDAC- on scb1002 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [11:08:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:14:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:18:20] (03PS1) 10Zoranzoki21: Enable AbuseFilter 'block' on it.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455307 (https://phabricator.wikimedia.org/T202808) [11:19:33] (03PS2) 10Zoranzoki21: Enable AbuseFilter 'block' on it.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455307 (https://phabricator.wikimedia.org/T202808) [11:51:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:55:06] (03CR) 10Daimona Eaytoy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455307 (https://phabricator.wikimedia.org/T202808) (owner: 10Zoranzoki21) [12:02:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:27:35] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10Marostegui) a:05Marostegui>03Papaul Can you pull the disk out wait a couple of minutes and insert it again? It failed to rebuild [13:46:57] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:57] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.036 second response time [17:05:17] (03PS1) 10Zoranzoki21: Fix "seperated" typo in MWMultiVersion.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455354 [17:05:38] (03PS2) 10Zoranzoki21: Fix "seperated" typo in MWMultiVersion.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455354 [18:12:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:18:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:18:58] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:19:37] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:19:47] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:19:47] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:19:47] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:19:47] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:20:08] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:30:08] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [18:30:18] RECOVERY - Disk space on stat1005 is OK: DISK OK [18:30:27] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [18:30:28] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [18:30:28] RECOVERY - DPKG on stat1005 is OK: All packages OK [18:30:37] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [18:30:57] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [18:39:27] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:39:58] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:40:07] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:40:07] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:40:08] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:40:08] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:42:28] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:42:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:46:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:00:39] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [19:00:39] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [19:00:47] RECOVERY - DPKG on stat1005 is OK: All packages OK [19:00:47] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [19:01:07] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [19:02:47] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:10:38] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [19:11:18] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [19:11:19] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [19:11:27] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [19:11:27] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [19:15:07] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [19:29:58] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [19:30:08] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [19:30:39] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [19:30:39] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [19:30:47] RECOVERY - DPKG on stat1005 is OK: All packages OK [19:30:47] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [19:41:27] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:43:28] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:56:37] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:00:57] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:09:37] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:11:47] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:22:30] (03PS1) 10Reedy: Add fixcopyright(\.m)?\.wikimedia\.org [dns] - 10https://gerrit.wikimedia.org/r/455368 [20:30:46] (03PS1) 10Reedy: Add fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/455369 [20:33:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:34:17] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:07] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 74428 bytes in 0.105 second response time [20:35:17] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:39:38] (03PS2) 10Reedy: Add fixcopyright(\.m)?\.wikimedia\.org [dns] - 10https://gerrit.wikimedia.org/r/455368 (https://phabricator.wikimedia.org/T202819) [20:39:55] (03PS2) 10Reedy: Add fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) [22:07:27] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 52562 MB (10% inode=99%) [22:28:58] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:30:07] RECOVERY - Disk space on elastic1017 is OK: DISK OK [22:39:28] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:28:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:39:28] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen