[00:14:10] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:13:20] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [01:14:10] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3369605 keys, up 55 days 16 hours - replication_delay is 0 [01:35:10] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:51:10] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:05:10] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [02:12:50] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:32:10] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [02:41:50] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:49:50] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:17:50] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [03:23:50] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 722.96 seconds [03:28:50] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.15 seconds [04:08:30] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2076.90 Read Requests/Sec=3685.30 Write Requests/Sec=490.50 KBytes Read/Sec=15337.60 KBytes_Written/Sec=6639.60 [04:12:25] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901922 (10Revent) @yann I have the impression that more action will be taken after the holidays. [04:16:30] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.40 Read Requests/Sec=215.50 Write Requests/Sec=6.50 KBytes Read/Sec=2157.20 KBytes_Written/Sec=417.60 [05:07:50] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:22:00] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [05:35:50] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:51:00] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:01:20] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:19:10] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:20:00] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:29:20] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:36:00] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [07:17:30] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:22:10] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:24:10] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:52:10] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:49:10] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:10] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 2 minutes ago with 11 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [09:17:10] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:33:30] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:26:00] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:35:19] Reedy: https://meta.wikimedia.org/wiki/Help_talk:Two-factor_authentication#No_scratch_codes_available [10:51:20] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:54:00] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:58:50] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2902043 (10elukey) >>! In T153488#2901590, @Yann wrote: > So it seems this bug is still quite serious, isn't?... [11:19:20] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:58:57] (03PS1) 10Tim Landscheidt: Revert "tools: store verbose logrotate logs" [puppet] - 10https://gerrit.wikimedia.org/r/329217 (https://phabricator.wikimedia.org/T96007) [11:59:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "tools: store verbose logrotate logs" [puppet] - 10https://gerrit.wikimedia.org/r/329217 (https://phabricator.wikimedia.org/T96007) (owner: 10Tim Landscheidt) [12:03:15] (03PS2) 10Tim Landscheidt: Revert "tools: store verbose logrotate logs" [puppet] - 10https://gerrit.wikimedia.org/r/329217 (https://phabricator.wikimedia.org/T96007) [13:01:20] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [13:29:20] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:39:20] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [14:48:52] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2902328 (10Ottomata) Nice find! Let's keep an eye on this and hope that they release something with Spark 2.0 soon so we can do an upgrade. [14:50:40] 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2902329 (10Ottomata) 05Open>03Resolved a:03Ottomata Ah thanks! [15:03:30] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:30] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:10:29] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2902353 (10Ottomata) Great! Yeah, if your number of dependencies is small enough, it is ea... [15:32:30] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:38:10] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:06:10] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:20:00] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:46:30] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:00] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:13:45] 07Puppet, 06Labs: role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#2902431 (10scfc) [17:14:30] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:36:51] (03CR) 10Ottomata: [C: 031] eventbus: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/328665 (owner: 10Muehlenhoff) [17:37:41] (03PS2) 10Ottomata: Add libgomp1 to hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/327173 (owner: 10EBernhardson) [17:39:30] (03CR) 10Ottomata: [C: 032] Add libgomp1 to hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/327173 (owner: 10EBernhardson) [17:40:29] (03CR) 10Ottomata: [C: 031] "Ha, +1, but surely I don't have much more context than you :p" [dns] - 10https://gerrit.wikimedia.org/r/326913 (owner: 10Jcrespo) [17:58:39] (03Abandoned) 10Ottomata: [WIP] Mirror main-eqiad into main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/304928 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [18:00:20] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:18:16] (03PS1) 10Tim Landscheidt: puppetmaster: Enable expand_path for Hiera in Labs as well [puppet] - 10https://gerrit.wikimedia.org/r/329226 [18:29:20] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:29:40] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:31:16] (03CR) 10Tim Landscheidt: "@Joe: You previously removed expand_path from Labs with 69e55590c178c585fafe7e691db6da25e93ee248; if you think there is a better way, plea" [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt) [18:59:40] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:05:28] (03PS2) 10Ottomata: Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) [19:06:15] (03CR) 10jerkins-bot: [V: 04-1] Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [19:06:51] !log otto@tin Starting deploy [eventstreams/deploy@e771863]: (no message) [19:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:02] !log otto@tin Finished deploy [eventstreams/deploy@e771863]: (no message) (duration: 00m 10s) [19:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:09] !log otto@tin Starting deploy [eventstreams/deploy@e771863]: (no message) [19:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:14] !log otto@tin Finished deploy [eventstreams/deploy@e771863]: (no message) (duration: 00m 04s) [19:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:24] !log otto@tin Starting deploy [eventstreams/deploy@e771863]: log [19:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:28] !log otto@tin Finished deploy [eventstreams/deploy@e771863]: log (duration: 00m 03s) [19:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:34] !log otto@tin Starting deploy [eventstreams/deploy@e771863]: (no message) [19:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:39] !log otto@tin Finished deploy [eventstreams/deploy@e771863]: (no message) (duration: 01m 05s) [19:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:30] !log otto@tin Starting deploy [eventstreams/deploy@836b441]: (no message) [19:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:47] !log otto@tin Finished deploy [eventstreams/deploy@836b441]: (no message) (duration: 01m 16s) [19:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:01] !log otto@tin Starting deploy [eventstreams/deploy@581a5a1]: (no message) [19:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:24] !log otto@tin Finished deploy [eventstreams/deploy@581a5a1]: (no message) (duration: 00m 22s) [19:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:47] !log otto@tin Starting deploy [eventstreams/deploy@90934c3]: (no message) [19:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:19] !log otto@tin Finished deploy [eventstreams/deploy@90934c3]: (no message) (duration: 00m 31s) [19:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:58] (03PS3) 10Ottomata: Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) [19:20:52] (03CR) 10Ottomata: [C: 032] Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [19:20:56] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/4993/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [19:25:51] (03PS1) 10Ottomata: Send EventStreams rdkafka config to statsd every minute [puppet] - 10https://gerrit.wikimedia.org/r/329233 [19:28:50] (03CR) 10Ottomata: [C: 032] Send EventStreams rdkafka config to statsd every minute [puppet] - 10https://gerrit.wikimedia.org/r/329233 (owner: 10Ottomata) [19:30:33] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/4994/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/329233 (owner: 10Ottomata) [19:31:49] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2902626 (10Revent) @elukey Just to be clear, I have not (other than possibly incidentally) been putting old fai... [19:32:28] !log otto@tin Starting deploy [eventstreams/deploy@90934c3]: (no message) [19:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:00] PROBLEM - eventstreams on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 8092: Connection refused [19:35:08] !log otto@tin Starting deploy [eventstreams/deploy@ed2e39c]: (no message) [19:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:20] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:36:26] me ^ all is ok [19:36:43] something weird with jinja scap config template... [19:38:18] !log otto@tin Starting deploy [eventstreams/deploy@648613a]: (no message) [19:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:52] !log otto@tin Finished deploy [eventstreams/deploy@648613a]: (no message) (duration: 00m 34s) [19:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:00] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.106 second response time [19:39:20] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [20:04:40] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:05:11] ottomata, do you have etherpad server access? [20:05:22] or do you even know how to deal with it? :) [20:06:36] yurik: i probably have server access, but i know nothing :) [20:06:46] same here :( [20:08:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:10:00] yurik: https://wikitech.wikimedia.org/wiki/Etherpad [20:13:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:13:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:24:22] !log otto@tin Starting deploy [eventstreams/deploy@648613a]: (no message) [20:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:30] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:37] !log otto@tin Finished deploy [eventstreams/deploy@648613a]: (no message) (duration: 00m 15s) [20:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:15] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:26] !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 12s) [20:27:28] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:10] PROBLEM - eventstreams on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 8092: Connection refused [20:30:13] !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 02m 45s) [20:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:20] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:30:20] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:26] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:07] !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 40s) [20:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:11] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:16] !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 04s) [20:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:01] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:05] !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 04s) [20:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:19] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:30] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:34:08] sca? [20:34:15] scb2001 is me [20:34:19] more scap prolems? [20:34:41] !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 01m 22s) [20:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:01] the catalog fetch fail is probably bogus. or do you mean something else for sca? [20:35:39] no that [20:35:48] thanks [20:36:10] we see those from time to time for various hosts, they always recover by the next run [20:36:42] are we still in a deployment freeze? I thought... or...? [20:37:27] my stuff is not live apergos [20:37:28] not prod at all [20:37:32] no public access [20:37:46] ah ha [20:38:03] oh codfw [20:38:07] good [20:38:30] !log otto@tin Starting deploy [eventstreams/deploy@590ea96]: (no message) [20:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:52] !log otto@tin Finished deploy [eventstreams/deploy@590ea96]: (no message) (duration: 00m 23s) [20:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:10] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.105 second response time [20:39:20] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [20:39:30] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [20:40:30] PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:42:30] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:53:30] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:56:20] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:30] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [21:02:34] (03PS1) 10Ottomata: stat1001 is now a spare and can be reclaimed [puppet] - 10https://gerrit.wikimedia.org/r/329243 (https://phabricator.wikimedia.org/T149438) [21:03:43] w00t! [21:04:04] (03CR) 10Ottomata: [C: 032] stat1001 is now a spare and can be reclaimed [puppet] - 10https://gerrit.wikimedia.org/r/329243 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [21:07:11] apergos: q: i can't remember how to clean stored configs out for icinga [21:07:12] ! [21:07:18] and wikitech search is not helping [21:07:19] do you remember? [21:07:21] something like [21:07:29] sudo puppetcleanstoredconfigs.rb [21:07:31] or some [21:07:32] thing [21:08:34] hold up, you want to look at the server lifecycle page, I'm in the middle of this other issue [21:08:41] gimme 1 minute [21:09:30] RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:11:58] apergos: aye i'm looking there [21:12:00] it seems to be gone [21:12:01] k [21:13:05] $ puppet node clean $ puppet node deactivate [21:13:17] this is for stat1001 cleanup right? [21:13:56] it's in the "Steps for DC-OPS" section, make sure you do the other pupept related stuff beforehand [21:13:58] *puppet [21:14:03] ahhh [21:14:06] thanks, yeah [21:14:09] ok, weird [21:14:13] i was looking for the old command [21:14:16] didn't realize it had changed [21:14:23] no good, it used to be more finicky steps, now it's better! [21:14:34] *no, good [21:14:36] those seem to be in the wrong order though..., i don't want to turn servies offline before I clean out icinga [21:14:42] commas gonna get me [21:15:02] you want to depool it (if there is pooling) [21:15:16] then take it out of puppet manifest [21:15:21] then tell puppet it's gone [21:15:40] hmmm ok, but it also says "services offline" [21:15:45] before node clean [21:15:52] i guess disabling icinga checks is fine... [21:15:53] fine! [21:15:55] i'll do it that way [21:15:58] :) [21:16:28] !log disabled active checks of stat1001 services T149438 [21:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:32] T149438: Replace stat1001 - https://phabricator.wikimedia.org/T149438 [21:16:38] Confirm all puppet manifest entires removal, DSH removal, Hiera data removal. [21:16:39] that [21:16:58] as well as it being gone from whatever pools, config files, etc [21:17:05] all that is dependent on the server and services [21:17:13] that's all I know [21:17:41] danke [21:17:44] (03PS1) 10Ottomata: Remove stat1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/329247 (https://phabricator.wikimedia.org/T149438) [21:19:13] (03CR) 10Ottomata: [C: 032] Remove stat1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/329247 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [21:20:16] (03PS1) 10Ottomata: Revert "Remove stat1001 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/329248 [21:20:26] apergos: ah, instructinos say to leave it in site.pp with role spare [21:20:52] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Remove stat1001 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/329248 (owner: 10Ottomata) [21:22:45] 06Operations, 10hardware-requests: Reclaim/Decommission (specify) stat1001 - https://phabricator.wikimedia.org/T154164#2902803 (10Ottomata) [21:24:20] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:31:34] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:39:34] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 2 minutes ago with 10 failures. Failed resources (up to 3 shown): Service[salt-minion],Service[ssh],Service[nagios-nrpe-server],Package[tzdata] [21:44:47] ah for reclaim as spare [21:44:57] right [21:59:34] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:07:34] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:21:24] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:22:54] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Transit: Init7 (donated) {#14009} [10Gbps]BR [22:24:54] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [22:43:05] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 23 probes of 261 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:49:24] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:58:04] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 8 probes of 261 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [23:58:54] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:44] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy