[01:23:36] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:52:39] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:20:31] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 17s) [02:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun May 14 02:26:33 UTC 2017 (duration 6m 2s) [02:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:09] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [04:01:09] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:08:19] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5768.60 Read Requests/Sec=1942.70 Write Requests/Sec=0.80 KBytes Read/Sec=33883.20 KBytes_Written/Sec=20.80 [04:17:19] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=29.60 Read Requests/Sec=0.40 Write Requests/Sec=0.40 KBytes Read/Sec=2.80 KBytes_Written/Sec=13.20 [05:22:49] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [05:23:49] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [07:33:49] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:22:49] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [09:13:19] PROBLEM - SSH on ms-be1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:09] RECOVERY - SSH on ms-be1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [09:25:49] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [09:25:59] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:59] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:49] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.138 second response time [09:29:50] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 4.271 second response time [11:11:49] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [11:48:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:48:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:56:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:58:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:03:02] (03CR) 10Hashar: [C: 031] Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [12:03:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:04:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:04:57] (03CR) 10Hashar: [C: 031] "I guess that is to prepare the migration to role/profile/module scheme? Should be a noop on contint1001 / contint2001 so feel free to de" [puppet] - 10https://gerrit.wikimedia.org/r/353357 (owner: 10Dzahn) [12:11:49] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [13:31:44] (03Draft1) 10Paladox: Install openjdk jdk version instead of jre [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353765 [13:31:46] (03PS2) 10Paladox: Install openjdk jdk version instead of jre [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353765 [13:33:27] (03PS4) 10Paladox: Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [14:04:45] (03Draft1) 10Paladox: Fix debian-rules-missing-recommended-target [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 [14:04:47] (03PS2) 10Paladox: Fix debian-rules-missing-recommended-target [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 [14:09:17] (03PS3) 10Paladox: Fix debian-rules-missing-recommended-target [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 [14:11:57] (03PS4) 10Paladox: Fix debian-rules-missing-recommended-target [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 [14:20:47] (03PS5) 10Paladox: Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [14:24:21] (03PS5) 10Paladox: Fix debian-rules-missing-recommended-target [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 [14:34:19] (03PS6) 10Paladox: Fix debian-rules-missing-recommended-target [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 [15:14:50] 06Operations, 06Commons, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 07Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229#3261065 (10Hydriz) [16:12:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:15:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [16:15:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:17:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:29:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:32:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:33:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:33:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:33:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:33:42] hi, since yesterday, I have a JS loading issue on Commons, I restarted my PC twice, and it didn't change anything [19:34:16] I have to purge a page 2 times for the JS to load [20:33:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:36:39] RECOVERY - MegaRAID on labstore1003 is OK: OK: optimal, 5 logical, 34 physical [20:41:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:08:38] yannf: one moment, let me see your personal JS [21:12:35] yannf: I think I can make some improvements, permission to edit? [21:13:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:13:40] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:15:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:15:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:23:13] sjoerddebruin, sure, please [21:25:03] yannf: alright, done. I've changed all scripts to the preferred mw.loader.load and wrapped it in a "mw.loader.using" so they will load correctly. [21:25:06] You'll probably experience less page shifting as well now. [21:29:58] thanks [21:30:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:31:14] Let me know if this improves your situation. It did help for me. :) [21:45:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:49:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:50:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:50:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:51:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:01:49] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:05:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:05:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [22:05:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [22:06:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:14:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:14:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:14:42] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:16:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:28:49] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:49] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:46:49] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:46:49] PROBLEM - MD RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:46:59] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:39] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1019 is OK: OK ferm input default policy is set [22:47:49] RECOVERY - MD RAID on ms-be1019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [22:47:49] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [22:52:16] Request from (my ip) via cp1053 cp1053, Varnish XID 41240874 [22:52:17] Error: 503, Backend fetch failed at Sun, 14 May 2017 22:51:50 GMT [22:53:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:54:14] https://status.wikimedia.org/ shows service disruptions too, so i guess/hope ops are aware [22:54:43] ok [22:55:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [22:55:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:55:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [22:56:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [22:57:19] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:57:49] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:58:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [23:00:19] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:05:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:05:19] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:05:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:07:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:09:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:42:19] PROBLEM - SSH on ms-be1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:49] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:42:59] PROBLEM - MD RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:42:59] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:43:09] RECOVERY - SSH on ms-be1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [23:43:39] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1019 is OK: OK ferm input default policy is set [23:43:49] RECOVERY - MD RAID on ms-be1019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:43:49] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures