[00:56:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:01:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:28:28] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:29:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83500.467367 Seconds [01:30:18] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83544.138244 Seconds [01:30:18] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83545.24789 Seconds [01:30:28] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83551.295048 Seconds [01:33:18] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:33:58] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83746.675654 Seconds [01:33:59] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83746.703798 Seconds [01:38:18] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 84024.410796 Seconds [01:53:08] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:53:18] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 6.235622 Seconds [01:53:18] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 7.167741 Seconds [01:53:28] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 13.144717 Seconds [01:53:58] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 51.681208 Seconds [01:53:59] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 57.262289 Seconds [01:53:59] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 57.290283 Seconds [01:56:28] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:21:08] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:22:18] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:28:11] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 07m 56s) [02:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:18] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:51:18] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:53:54] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 07m 36s) [02:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:29] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Apr 9 02:59:29 UTC 2017 (duration 5m 35s) [02:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:18] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:09] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [03:11:08] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:15:18] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [03:16:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 3653.493023 Seconds [03:17:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:18:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 3769.507173 Seconds [03:20:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:38:08] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [03:49:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:50:58] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:53:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [03:54:28] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:58:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:10:48] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=652.00 Read Requests/Sec=408.40 Write Requests/Sec=0.50 KBytes Read/Sec=40392.80 KBytes_Written/Sec=17.60 [04:11:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 7004.463698 Seconds [04:13:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:14:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 7129.555326 Seconds [04:15:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [04:16:08] PROBLEM - puppet last run on mc1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:18:48] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=45.00 Read Requests/Sec=0.60 Write Requests/Sec=0.40 KBytes Read/Sec=4.40 KBytes_Written/Sec=4.40 [04:19:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 7429.572563 Seconds [04:20:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [04:20:58] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:21:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 7553.54276 Seconds [04:23:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [04:30:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 8144.765729 Seconds [04:31:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:37:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 8513.47083 Seconds [04:38:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [04:43:08] RECOVERY - puppet last run on mc1028 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:55:59] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 9645.182039 Seconds [04:56:58] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [05:09:28] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:29:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 11634.053206 Seconds [05:31:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [05:37:28] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:39:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 12233.783922 Seconds [05:40:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [05:45:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 12596.885176 Seconds [05:46:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [05:55:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 13196.795254 Seconds [05:57:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [06:03:08] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:05:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 13793.719078 Seconds [06:07:09] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [06:25:08] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:31:08] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:40:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:45:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 16189.706154 Seconds [06:45:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:47:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [06:54:08] RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:02:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:03:58] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time [07:04:08] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:04:58] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.029 second response time [07:05:08] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 75853 bytes in 0.102 second response time [07:17:09] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 18109.740741 Seconds [07:18:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [07:25:08] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:08] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:34:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 19136.738844 Seconds [07:35:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [07:39:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:44:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:48:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 19973.696106 Seconds [07:49:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [07:53:08] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:28:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 22369.582751 Seconds [08:29:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [08:30:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 22496.811046 Seconds [08:32:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [08:33:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 22673.702944 Seconds [08:34:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 22729.620497 Seconds [08:34:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [08:38:08] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [08:43:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 23269.616324 Seconds [08:44:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [08:46:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 23456.713255 Seconds [08:47:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [08:56:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 24056.657684 Seconds [08:56:18] (03PS1) 10Urbanecm: Initial configuration for wbwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) [08:58:09] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 24173.798402 Seconds [08:59:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [08:59:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [09:06:18] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:13:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 25076.873039 Seconds [09:14:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [09:20:09] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 25492.583293 Seconds [09:22:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [09:39:28] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:09] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 26692.962834 Seconds [09:40:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 26697.071645 Seconds [09:40:28] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [09:41:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [09:41:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [09:41:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 26766.757289 Seconds [09:42:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [09:42:28] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [09:46:18] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [09:46:51] what's happening :P [09:47:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 27112.889844 Seconds [09:47:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 27126.638808 Seconds [09:48:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [09:49:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [09:50:43] gehel: FYI all those delays ^^^ seems to match with my theory, auto-vacuum is running on the master (maps1001) [09:52:38] volans: yeah, I have a patch that might help get better alerting. Coming up on Monday... [09:54:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 27546.739515 Seconds [09:56:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [09:59:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 27846.662603 Seconds [10:01:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [10:07:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 28312.930035 Seconds [10:08:09] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [10:18:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 28986.647251 Seconds [10:20:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [10:25:09] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:36:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 30052.975401 Seconds [10:37:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [10:39:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 30246.74998 Seconds [10:41:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [10:42:13] (03PS1) 10DatGuy: Initial configuration for dtywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) [10:49:06] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3166608 (10DatGuy) 05Open>03stalled Blocked for logo. Waiting to hear what "The Free Encyclopedia" is in Doteli. [10:53:09] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:02:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 31626.657022 Seconds [11:04:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:24:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 32933.12555 Seconds [11:28:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [11:50:28] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [11:52:37] 06Operations, 10Icinga: Update icinga to 2.x - https://phabricator.wikimedia.org/T162542#3166631 (10Paladox) [11:56:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 34856.899012 Seconds [11:57:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [11:57:28] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 35586.998671 Seconds [12:11:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [12:12:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 35813.112297 Seconds [12:13:08] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [12:26:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:26:28] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:28:18] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:33:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:33:18] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:41:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 37556.765675 Seconds [12:43:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [12:48:18] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 37976.681533 Seconds [12:49:08] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 38033.462542 Seconds [12:50:09] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [12:50:18] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [13:04:07] 06Operations, 10Icinga: Update icinga to 2.x - https://phabricator.wikimedia.org/T162542#3166673 (10faidon) 05Open>03declined [13:21:46] 06Operations, 10Icinga: Update icinga to 2.x - https://phabricator.wikimedia.org/T162542#3166674 (10Paladox) Not sure why you closed it as declined. [13:32:38] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:28] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [13:56:59] 06Operations, 10Phabricator, 07LDAP: Create a LDAP user for account Seanchen (Sean Chen) - https://phabricator.wikimedia.org/T162544#3166696 (10Paladox) [14:01:38] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:19:18] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#2962020 (10H-stt) >>! In T156029#3053235, @BBlack wrote: > >>>! In T156029#3053179, @Gnom1 wrote: >> The goal is to //have Wikipedia's servers run on renewable energy//. It's as simple as that. > > I don't... [14:28:48] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:44:38] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:53:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [14:54:48] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3063343 keys, up 16 days 22 hours - replication_delay is 0 [14:56:48] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:12:38] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:18:11] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Language-setup: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915#3166755 (10Liuxinyu970226) After re-clarification of @stevenj81, https://incubator.wikimedia.org/wiki/Wt/nan now shows "This project uses a different ISO code... [16:22:28] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:40:28] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:28] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:01:18] PROBLEM - puppet last run on db1091 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:28] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:29:18] RECOVERY - puppet last run on db1091 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:23:38] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:42] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Language-setup: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915#3166940 (10StevenJ81) I didn't really intend what I wrote as a resolution; I intended it as a stopgap until such time as this issue is completely resolved. I... [18:51:38] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:54:18] PROBLEM - puppet last run on aqs1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:18] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:23:18] RECOVERY - puppet last run on aqs1007 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:29:27] (03PS1) 10Hoo man: Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) [19:30:47] (03CR) 10jerkins-bot: [V: 04-1] Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [19:31:18] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:32:42] (03PS2) 10Hoo man: Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) [19:34:02] (03CR) 10jerkins-bot: [V: 04-1] Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [19:39:18] 06Operations, 10Icinga: Update icinga to 2.x - https://phabricator.wikimedia.org/T162542#3166994 (10Paladox) 05declined>03Open Re opening as no reason for decline. [19:39:34] (03PS3) 10Hoo man: Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) [20:40:18] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:48:02] !ops [21:04:02] (03PS1) 10Phuedx: pagePreviews: Enable NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) [21:08:18] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:44:37] 06Operations, 10Traffic: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057797 (10Volans) Since a couple of days both `einsteinium` and `tegmen` are spamming root@ every hour with certspotter errors, this time seems that the DigiCert service is responding 400 for the c... [21:49:48] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:53:44] 06Operations, 10MediaWiki-extensions-PageAssessments: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#3167121 (10Volans) **Since Feb. 19th** we're getting one email every day from terbium with an error for each wiki (~900 lines email) with: ``` The following extensions are required to be ins... [22:47:58] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:03:43] 06Operations, 10MediaWiki-extensions-PageAssessments: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#3167193 (10Peachey88) @Volans That would be {T159438} I believe [23:15:28] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:26:28] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:56] 06Operations, 10Icinga: Update icinga to 2.x - https://phabricator.wikimedia.org/T162542#3167201 (10faidon) 05Open>03declined Because it will take ten times as long to explain why than what it took you to open this task. You casually talked about a complicated weeks- or months-long project for an "upgrade"... [23:43:28] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [23:54:28] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:54:38] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: http status 500