[00:14:10] <icinga-wm>	 RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[01:13:20] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[01:14:10] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3369605 keys, up 55 days 16 hours - replication_delay is 0
[01:35:10] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:51:10] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[02:05:10] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[02:12:50] <icinga-wm>	 PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:32:10] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[02:41:50] <icinga-wm>	 RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[02:49:50] <icinga-wm>	 PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:17:50] <icinga-wm>	 RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[03:23:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 722.96 seconds
[03:28:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.15 seconds
[04:08:30] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2076.90 Read Requests/Sec=3685.30 Write Requests/Sec=490.50 KBytes Read/Sec=15337.60 KBytes_Written/Sec=6639.60
[04:12:25] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901922 (10Revent) @yann I have the impression that more action will be taken after the holidays.
[04:16:30] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.40 Read Requests/Sec=215.50 Write Requests/Sec=6.50 KBytes Read/Sec=2157.20 KBytes_Written/Sec=417.60
[05:07:50] <icinga-wm>	 PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:22:00] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%)
[05:35:50] <icinga-wm>	 RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[05:51:00] <icinga-wm>	 PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:01:20] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:19:10] <icinga-wm>	 PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:20:00] <icinga-wm>	 RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[06:29:20] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[06:36:00] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[07:17:30] <icinga-wm>	 RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[07:22:10] <icinga-wm>	 RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[07:24:10] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:52:10] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[08:49:10] <icinga-wm>	 PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:05:10] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 2 minutes ago with 11 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[09:17:10] <icinga-wm>	 RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[09:33:30] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[10:26:00] <icinga-wm>	 PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:35:19] <tabbycat>	 Reedy: https://meta.wikimedia.org/wiki/Help_talk:Two-factor_authentication#No_scratch_codes_available
[10:51:20] <icinga-wm>	 PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:54:00] <icinga-wm>	 RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[10:58:50] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2902043 (10elukey) >>! In T153488#2901590, @Yann wrote: > So it seems this bug is still quite serious, isn't?...
[11:19:20] <icinga-wm>	 RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[11:58:57] <wikibugs>	 (03PS1) 10Tim Landscheidt: Revert "tools: store verbose logrotate logs" [puppet] - 10https://gerrit.wikimedia.org/r/329217 (https://phabricator.wikimedia.org/T96007)
[11:59:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "tools: store verbose logrotate logs" [puppet] - 10https://gerrit.wikimedia.org/r/329217 (https://phabricator.wikimedia.org/T96007) (owner: 10Tim Landscheidt)
[12:03:15] <wikibugs>	 (03PS2) 10Tim Landscheidt: Revert "tools: store verbose logrotate logs" [puppet] - 10https://gerrit.wikimedia.org/r/329217 (https://phabricator.wikimedia.org/T96007)
[13:01:20] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[13:29:20] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[14:39:20] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[14:48:52] <wikibugs>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2902328 (10Ottomata) Nice find!  Let's keep an eye on this and hope that they release something with Spark 2.0 soon so we can do an upgrade.
[14:50:40] <wikibugs>	 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2902329 (10Ottomata) 05Open>03Resolved a:03Ottomata Ah thanks!
[15:03:30] <icinga-wm>	 PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:07:30] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[15:10:29] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2902353 (10Ottomata) Great!  Yeah, if your number of dependencies is small enough, it is ea...
[15:32:30] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[15:38:10] <icinga-wm>	 PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:06:10] <icinga-wm>	 RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[16:20:00] <icinga-wm>	 PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:46:30] <icinga-wm>	 PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:49:00] <icinga-wm>	 RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[17:13:45] <wikibugs>	 07Puppet, 06Labs: role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#2902431 (10scfc)
[17:14:30] <icinga-wm>	 RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[17:36:51] <wikibugs>	 (03CR) 10Ottomata: [C: 031] eventbus: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/328665 (owner: 10Muehlenhoff)
[17:37:41] <wikibugs>	 (03PS2) 10Ottomata: Add libgomp1 to hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/327173 (owner: 10EBernhardson)
[17:39:30] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add libgomp1 to hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/327173 (owner: 10EBernhardson)
[17:40:29] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "Ha, +1, but surely I don't have much more context than you :p" [dns] - 10https://gerrit.wikimedia.org/r/326913 (owner: 10Jcrespo)
[17:58:39] <wikibugs>	 (03Abandoned) 10Ottomata: [WIP] Mirror main-eqiad into main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/304928 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata)
[18:00:20] <icinga-wm>	 PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:18:16] <wikibugs>	 (03PS1) 10Tim Landscheidt: puppetmaster: Enable expand_path for Hiera in Labs as well [puppet] - 10https://gerrit.wikimedia.org/r/329226
[18:29:20] <icinga-wm>	 RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[18:29:40] <icinga-wm>	 PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:31:16] <wikibugs>	 (03CR) 10Tim Landscheidt: "@Joe: You previously removed expand_path from Labs with 69e55590c178c585fafe7e691db6da25e93ee248; if you think there is a better way, plea" [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt)
[18:59:40] <icinga-wm>	 RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[19:05:28] <wikibugs>	 (03PS2) 10Ottomata: Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925)
[19:06:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata)
[19:06:51] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@e771863]: (no message)
[19:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:02] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@e771863]: (no message) (duration: 00m 10s)
[19:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:09] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@e771863]: (no message)
[19:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:14] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@e771863]: (no message) (duration: 00m 04s)
[19:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:24] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@e771863]: log
[19:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:28] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@e771863]: log (duration: 00m 03s)
[19:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:34] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@e771863]: (no message)
[19:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:39] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@e771863]: (no message) (duration: 01m 05s)
[19:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:30] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@836b441]: (no message)
[19:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:47] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@836b441]: (no message) (duration: 01m 16s)
[19:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:01] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@581a5a1]: (no message)
[19:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:24] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@581a5a1]: (no message) (duration: 00m 22s)
[19:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:47] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@90934c3]: (no message)
[19:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:19] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@90934c3]: (no message) (duration: 00m 31s)
[19:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:58] <wikibugs>	 (03PS3) 10Ottomata: Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925)
[19:20:52] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata)
[19:20:56] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/4993/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata)
[19:25:51] <wikibugs>	 (03PS1) 10Ottomata: Send EventStreams rdkafka config to statsd every minute [puppet] - 10https://gerrit.wikimedia.org/r/329233
[19:28:50] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Send EventStreams rdkafka config to statsd every minute [puppet] - 10https://gerrit.wikimedia.org/r/329233 (owner: 10Ottomata)
[19:30:33] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/4994/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/329233 (owner: 10Ottomata)
[19:31:49] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2902626 (10Revent) @elukey Just to be clear, I have not (other than possibly incidentally) been putting old fai...
[19:32:28] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@90934c3]: (no message)
[19:32:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:00] <icinga-wm>	 PROBLEM - eventstreams on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 8092: Connection refused
[19:35:08] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@ed2e39c]: (no message)
[19:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:20] <icinga-wm>	 PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:36:26] <ottomata>	 me ^  all is ok
[19:36:43] <ottomata>	 something weird with jinja scap config template...
[19:38:18] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@648613a]: (no message)
[19:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:52] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@648613a]: (no message) (duration: 00m 34s)
[19:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:00] <icinga-wm>	 RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.106 second response time
[19:39:20] <icinga-wm>	 RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational
[20:04:40] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[20:05:11] <yurik>	 ottomata, do you have etherpad server access?
[20:05:22] <yurik>	 or do you even know how to deal with it? :)
[20:06:36] <ottomata>	 yurik:  i probably have server access, but i know nothing :)
[20:06:46] <yurik>	 same here :(
[20:08:10] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[20:10:00] <MatmaRex>	 yurik: https://wikitech.wikimedia.org/wiki/Etherpad
[20:13:30] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:13:40] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:24:22] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@648613a]: (no message)
[20:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:30] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:24:37] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@648613a]: (no message) (duration: 00m 15s)
[20:24:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:15] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:27:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:26] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 12s)
[20:27:28] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:10] <icinga-wm>	 PROBLEM - eventstreams on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 8092: Connection refused
[20:30:13] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 02m 45s)
[20:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:20] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:30:20] <icinga-wm>	 PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:26] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:30:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:07] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 40s)
[20:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:11] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:31:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:16] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 04s)
[20:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:01] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:32:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:05] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 04s)
[20:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:19] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:30] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:34:08] <ottomata>	 sca?
[20:34:15] <ottomata>	 scb2001 is me
[20:34:19] <ottomata>	 more scap prolems?
[20:34:41] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 01m 22s)
[20:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:01] <apergos>	 the catalog fetch fail is probably bogus. or do you mean something else for sca?
[20:35:39] <ottomata>	 no that
[20:35:48] <ottomata>	 thanks
[20:36:10] <apergos>	 we see those from time to time for various hosts, they always recover by the next run
[20:36:42] <apergos>	 are we still in a deployment freeze?  I thought... or...?
[20:37:27] <ottomata>	 my stuff is not live apergos
[20:37:28] <ottomata>	 not prod at all
[20:37:32] <ottomata>	 no public access
[20:37:46] <apergos>	 ah ha
[20:38:03] <apergos>	 oh codfw
[20:38:07] <apergos>	 good
[20:38:30] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message)
[20:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:52] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 23s)
[20:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:10] <icinga-wm>	 RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.105 second response time
[20:39:20] <icinga-wm>	 RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational
[20:39:30] <icinga-wm>	 PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy]
[20:40:30] <icinga-wm>	 PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:42:30] <icinga-wm>	 RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[20:53:30] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[20:56:20] <icinga-wm>	 PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:02:30] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[21:02:34] <wikibugs>	 (03PS1) 10Ottomata: stat1001 is now a spare and can be reclaimed [puppet] - 10https://gerrit.wikimedia.org/r/329243 (https://phabricator.wikimedia.org/T149438)
[21:03:43] <apergos>	 w00t!
[21:04:04] <wikibugs>	 (03CR) 10Ottomata: [C: 032] stat1001 is now a spare and can be reclaimed [puppet] - 10https://gerrit.wikimedia.org/r/329243 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata)
[21:07:11] <ottomata>	 apergos: q:  i can't remember how to clean stored configs out for icinga
[21:07:12] <ottomata>	 !
[21:07:18] <ottomata>	 and wikitech search is not helping
[21:07:19] <ottomata>	 do you remember?
[21:07:21] <ottomata>	 something like
[21:07:29] <ottomata>	 sudo puppetcleanstoredconfigs.rb 
[21:07:31] <ottomata>	 or some
[21:07:32] <ottomata>	 thing
[21:08:34] <apergos>	 hold up, you want to look at the server lifecycle page, I'm in the middle of this other issue
[21:08:41] <apergos>	 gimme 1 minute
[21:09:30] <icinga-wm>	 RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[21:11:58] <ottomata>	 apergos:  aye i'm looking there
[21:12:00] <ottomata>	 it seems to be gone
[21:12:01] <ottomata>	 k
[21:13:05] <apergos>	  $ puppet node clean <fqdn>          $ puppet node deactivate <fqdn> 
[21:13:17] <apergos>	 this is for stat1001 cleanup right?
[21:13:56] <apergos>	 it's in the "Steps for DC-OPS" section, make sure you do the other pupept related stuff beforehand
[21:13:58] <apergos>	 *puppet
[21:14:03] <ottomata>	 ahhh
[21:14:06] <ottomata>	 thanks, yeah
[21:14:09] <ottomata>	 ok, weird
[21:14:13] <ottomata>	 i was looking for the old command
[21:14:16] <ottomata>	 didn't realize it had changed
[21:14:23] <apergos>	 no good, it used to be more finicky steps, now it's better!
[21:14:34] <apergos>	 *no, good 
[21:14:36] <ottomata>	 those seem to be in the wrong order though..., i don't want to turn servies offline before I clean out icinga
[21:14:42] <apergos>	 commas gonna get me
[21:15:02] <apergos>	 you want to depool it (if there is pooling)
[21:15:16] <apergos>	 then take it out of puppet manifest
[21:15:21] <apergos>	 then tell puppet it's gone
[21:15:40] <ottomata>	 hmmm ok, but it also says "services offline"
[21:15:45] <ottomata>	 before node clean
[21:15:52] <ottomata>	 i guess disabling icinga checks is fine...
[21:15:53] <ottomata>	 fine!
[21:15:55] <ottomata>	 i'll do it that way
[21:15:58] <ottomata>	 :)
[21:16:28] <ottomata>	 !log disabled active checks of stat1001 services T149438
[21:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:32] <stashbot>	 T149438: Replace stat1001  - https://phabricator.wikimedia.org/T149438
[21:16:38] <apergos>	  Confirm all puppet manifest entires removal, DSH removal, Hiera data removal.
[21:16:39] <apergos>	 that
[21:16:58] <apergos>	 as well as it being gone from whatever pools, config files, etc
[21:17:05] <apergos>	 all that is dependent on the server and services
[21:17:13] <apergos>	 that's all I know
[21:17:41] <ottomata>	 danke
[21:17:44] <wikibugs>	 (03PS1) 10Ottomata: Remove stat1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/329247 (https://phabricator.wikimedia.org/T149438)
[21:19:13] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Remove stat1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/329247 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata)
[21:20:16] <wikibugs>	 (03PS1) 10Ottomata: Revert "Remove stat1001 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/329248
[21:20:26] <ottomata>	 apergos:  ah, instructinos say to leave it in site.pp with role spare
[21:20:52] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Revert "Remove stat1001 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/329248 (owner: 10Ottomata)
[21:22:45] <wikibugs>	 06Operations, 10hardware-requests: Reclaim/Decommission (specify) stat1001 - https://phabricator.wikimedia.org/T154164#2902803 (10Ottomata)
[21:24:20] <icinga-wm>	 RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[21:31:34] <icinga-wm>	 PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:39:34] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 2 minutes ago with 10 failures. Failed resources (up to 3 shown): Service[salt-minion],Service[ssh],Service[nagios-nrpe-server],Package[tzdata]
[21:44:47] <apergos>	 ah for reclaim as spare
[21:44:57] <apergos>	 right
[21:59:34] <icinga-wm>	 RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[22:07:34] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[22:21:24] <icinga-wm>	 PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:22:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Transit: Init7 (donated) {#14009} [10Gbps]BR
[22:24:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0
[22:43:05] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 23 probes of 261 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[22:49:24] <icinga-wm>	 RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[22:58:04] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 8 probes of 261 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[23:58:54] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:59:44] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy