[00:08:14] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:25] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:34] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:24] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72864 bytes in 0.132 second response time [00:09:24] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time [00:10:04] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [00:14:04] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:14:34] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:34] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:24] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72824 bytes in 0.139 second response time [00:15:24] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [00:19:04] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41602 MB (3% inode=99%) [00:20:34] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:34] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:15] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:16] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.029 second response time [00:35:34] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.050 second response time [00:35:34] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72824 bytes in 0.144 second response time [01:02:14] (03PS1) 10Reedy: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396714 (https://phabricator.wikimedia.org/T182506) [01:04:04] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:17:34] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:45] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:45] (03PS11) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [01:17:54] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:18:09] (03PS25) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [01:19:25] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.035 second response time [01:19:44] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.080 second response time [01:19:44] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72864 bytes in 0.150 second response time [02:16:55] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:17:04] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:17:44] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:04] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72825 bytes in 7.983 second response time [02:22:34] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.060 second response time [02:22:54] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.042 second response time [02:41:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [02:46:04] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:13:35] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41247 MB (3% inode=99%) [03:19:04] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:19:14] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:19:44] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [03:24:55] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 796.83 seconds [03:34:05] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.829 second response time [03:34:15] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72856 bytes in 0.140 second response time [03:34:35] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.047 second response time [03:41:04] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:43:10] (03PS1) 10Catrope: Give rcOresDamagingPref the same default as oresDamagingPref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396805 (https://phabricator.wikimedia.org/T182354) [03:47:13] (03PS1) 10Catrope: Revert "Disable ORES in fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396806 (https://phabricator.wikimedia.org/T182354) [03:47:26] (03PS2) 10Catrope: Revert "Disable ORES in fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396806 (https://phabricator.wikimedia.org/T182354) [03:57:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 236.18 seconds [04:05:44] (03CR) 10EddieGP: [C: 04-1] "This isn't doing everything that's needed. It's especially not fulfilling the bullet points" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [04:12:54] (03CR) 10EddieGP: [C: 04-1] "The als part LGTM though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [04:16:24] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:35] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:17:54] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:19:54] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 4.306 second response time [04:20:14] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.061 second response time [04:20:34] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72856 bytes in 0.148 second response time [04:25:44] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:54] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:24] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:34] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72842 bytes in 0.098 second response time [04:30:45] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.051 second response time [04:31:15] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.075 second response time [05:17:55] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:14] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:35] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:54] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.029 second response time [05:19:05] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.024 second response time [05:19:25] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 72842 bytes in 0.098 second response time [05:22:04] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:22:34] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:22:54] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:24] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.033 second response time [05:23:35] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:44] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72844 bytes in 0.157 second response time [05:23:55] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.077 second response time [05:24:04] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:24:15] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:43:04] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:17:44] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:17:55] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:18:14] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:18:34] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.039 second response time [06:18:54] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72938 bytes in 0.118 second response time [06:19:04] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.040 second response time [06:19:55] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.023 second response time [06:20:24] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 72936 bytes in 0.094 second response time [06:20:34] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time [06:43:04] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:10:45] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:42:24] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [07:44:25] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:44:34] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:44:55] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:00:45] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:35:25] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [08:50:45] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [08:51:44] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [09:05:34] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72798 bytes in 0.339 second response time [09:05:54] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.078 second response time [09:06:05] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.048 second response time [09:19:44] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:45] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:14] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:30] (03CR) 10MarcoAurelio: apache: redirect several wikis per Board of Trustees and LangCom request (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [09:56:49] (03CR) 10MarcoAurelio: "I can remove wikiversions.json from this patch for now and add it inmediatelly before SWAT so there are no merge conflicts. As for CommonS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [10:15:31] (03PS1) 10ArielGlenn: fix up cron job for listing last n good dumps [puppet] - 10https://gerrit.wikimedia.org/r/396912 [10:16:13] (03CR) 10ArielGlenn: [C: 032] fix up cron job for listing last n good dumps [puppet] - 10https://gerrit.wikimedia.org/r/396912 (owner: 10ArielGlenn) [10:17:04] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:15] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:45] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:05] PROBLEM - Nginx local proxy to apache on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:34] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:35] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:45] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3826196 (10MarcoAurelio) I hate this script: ``` /modules/mediawiki/files/apache/sites/redirects (review/marcoaurelio/T169450) $... [10:23:25] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 72796 bytes in 0.101 second response time [10:23:34] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time [10:24:04] RECOVERY - Nginx local proxy to apache on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.046 second response time [10:29:44] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.057 second response time [10:30:04] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 72796 bytes in 0.089 second response time [10:30:24] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.031 second response time [10:31:23] (03PS8) 10MarcoAurelio: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [10:32:10] (03PS9) 10MarcoAurelio: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [10:43:37] (03PS7) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) [10:46:20] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3826244 (10MarcoAurelio) Well. The redirects apache config is done. Can we deploy this now (next puppet swat window) or do we have... [10:48:30] (03CR) 10ArielGlenn: [C: 032] move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [11:00:37] (03PS1) 10ArielGlenn: add the deprecated dumps user to the labstore boxes temporarily [puppet] - 10https://gerrit.wikimedia.org/r/396914 [11:03:43] (03CR) 10ArielGlenn: [C: 032] add the deprecated dumps user to the labstore boxes temporarily [puppet] - 10https://gerrit.wikimedia.org/r/396914 (owner: 10ArielGlenn) [11:04:24] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:17:14] PROBLEM - Nginx local proxy to apache on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:17:44] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:17:44] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:34] (03CR) 10Nikerabbit: [C: 04-1] Extension:Translate default permissions for Wikimedia wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) (owner: 10MarcoAurelio) [11:34:34] (03PS1) 10ArielGlenn: make all incoming rsyncs of datasets to datasets servers run as dumpsgen [puppet] - 10https://gerrit.wikimedia.org/r/396915 (https://phabricator.wikimedia.org/T113467) [11:39:56] (03PS2) 10ArielGlenn: make all incoming rsyncs of datasets to datasets servers run as dumpsgen [puppet] - 10https://gerrit.wikimedia.org/r/396915 (https://phabricator.wikimedia.org/T113467) [11:55:54] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.067 second response time [11:55:54] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 72838 bytes in 0.151 second response time [11:56:24] RECOVERY - Nginx local proxy to apache on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.067 second response time [12:03:45] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.037 second response time [12:04:04] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.042 second response time [12:04:34] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72837 bytes in 0.104 second response time [12:11:41] (03CR) 10EddieGP: [C: 031] "LGTM and thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [12:16:44] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:55] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:07] (03PS3) 10Zoranzoki21: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [12:17:09] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3826287 (10EddieGP) >>! In T169450#3826244, @MarcoAurelio wrote: > Well. The redirects apache config is done. Can we deploy this n... [12:17:14] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:27] (03CR) 10Zoranzoki21: [C: 031] "Should be ok now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [12:18:02] (03CR) 10Zoranzoki21: [C: 031] Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396714 (https://phabricator.wikimedia.org/T182506) (owner: 10Reedy) [12:22:39] (03PS4) 10Revi: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) [12:24:07] (03CR) 10Revi: "Task is not about removing the NS aliases." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [12:38:34] PROBLEM - Nginx local proxy to apache on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:04] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:05] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:24] RECOVERY - Nginx local proxy to apache on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [12:40:30] (03CR) 10Zoranzoki21: "> Task is not about removing the NS aliases." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [12:40:54] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time [12:40:55] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 72798 bytes in 0.172 second response time [12:48:43] (03CR) 10Revi: "> > Task is not about removing the NS aliases." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [12:54:37] (03CR) 10Luke081515: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [13:21:05] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:14] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:34] PROBLEM - Nginx local proxy to apache on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:04] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.083 second response time [13:24:04] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 72870 bytes in 0.370 second response time [13:24:26] RECOVERY - Nginx local proxy to apache on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.071 second response time [14:22:14] PROBLEM - Apache HTTP on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [14:23:14] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.073 second response time [14:28:12] (03CR) 10ArielGlenn: [C: 032] make all incoming rsyncs of datasets to datasets servers run as dumpsgen [puppet] - 10https://gerrit.wikimedia.org/r/396915 (https://phabricator.wikimedia.org/T113467) (owner: 10ArielGlenn) [15:00:42] (03PS10) 10Zoranzoki21: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [15:00:42] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [15:19:20] (03PS1) 10ArielGlenn: convert all incoming rsyncs, all cron jobs on dumps webservers to dumpsgen user [puppet] - 10https://gerrit.wikimedia.org/r/396919 (https://phabricator.wikimedia.org/T113467) [15:19:54] (03CR) 10jerkins-bot: [V: 04-1] convert all incoming rsyncs, all cron jobs on dumps webservers to dumpsgen user [puppet] - 10https://gerrit.wikimedia.org/r/396919 (https://phabricator.wikimedia.org/T113467) (owner: 10ArielGlenn) [15:24:49] (03PS2) 10ArielGlenn: convert all incoming rsyncs, all cron jobs on dumps webservers to dumpsgen user [puppet] - 10https://gerrit.wikimedia.org/r/396919 (https://phabricator.wikimedia.org/T113467) [15:44:39] (03PS3) 10ArielGlenn: convert all incoming rsyncs, all cron jobs on dumps webservers to dumpsgen user [puppet] - 10https://gerrit.wikimedia.org/r/396919 (https://phabricator.wikimedia.org/T113467) [16:00:24] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [16:21:54] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time [16:22:44] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time [16:22:44] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72798 bytes in 0.459 second response time [16:25:18] (03CR) 10ArielGlenn: [C: 032] convert all incoming rsyncs, all cron jobs on dumps webservers to dumpsgen user [puppet] - 10https://gerrit.wikimedia.org/r/396919 (https://phabricator.wikimedia.org/T113467) (owner: 10ArielGlenn) [16:25:24] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:58:21] (03PS1) 10ArielGlenn: remove the last vestiges of the datasets user [puppet] - 10https://gerrit.wikimedia.org/r/396928 (https://phabricator.wikimedia.org/T113467) [17:15:59] 10Operations, 10Dumps-Generation, 10Patch-For-Review: fix up datasets uid - https://phabricator.wikimedia.org/T113467#3826541 (10ArielGlenn) @hoo, heads up: all jobs on the snapshot hosts including your wikidata cron jobs run as the dumpsgen user. You already have sudo privs on the snapshots as that user. D... [17:59:24] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:16:44] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:45] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:25] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:56:45] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 72265 bytes in 6.549 second response time [19:56:54] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time [19:57:34] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.030 second response time [20:01:01] !log ran kafka preferred-replica-election for the kafka analytics cluster (1012->1022) to re-add kafka1012 to the kafka brokers acting as partition leaders (will spread the load in a better way) [20:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:08] (03PS1) 10ArielGlenn: get rid of datasets1001 mount on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) [20:02:43] (03CR) 10jerkins-bot: [V: 04-1] get rid of datasets1001 mount on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [20:06:01] (03PS2) 10ArielGlenn: get rid of datasets1001 mount on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) [20:13:05] apergos: just seen the msg sorry! I'd have tried a strace but no other idea for the moment [20:13:27] yeah it's fine elukey [20:13:34] I restarted it long ago anyways [20:13:59] I figuredin case you happened to be around in the next few minutes, but... sunday, unlikely [20:16:22] seems to be api-appservers related afaics [20:17:31] (03PS3) 10ArielGlenn: get rid of datasets1001 mount on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) [20:17:44] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:54] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:54] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [20:18:04] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:00] kafka1012 is ok, just rebalanced the load.. kafka1018 is still down [20:20:34] PROBLEM - kartotherian endpoints health on maps-test2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) timed out before a response was received [20:20:34] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.027 second response time [20:20:44] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 72264 bytes in 0.122 second response time [20:20:55] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.046 second response time [20:21:24] RECOVERY - kartotherian endpoints health on maps-test2002 is OK: All endpoints are healthy [20:22:54] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4029 is CRITICAL: connect to address 10.128.0.129 and port 3128: Connection refused [20:23:54] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4029 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [20:33:53] !log execute restart-hhvm on mw1312 - hhvm stuck multiple times queueing requests [20:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:10] (03CR) 10ArielGlenn: [C: 04-2] "This cannot be merged until the current wikidata weekly finishes up, as it's running on the old filesystem. It's doing rdf truthy-nt dump" [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [20:34:42] (03CR) 10ArielGlenn: [C: 04-2] "This cannot be merged until the current wikidata weekly cron job finishes up, as it is running as the datasets user." [puppet] - 10https://gerrit.wikimedia.org/r/396928 (https://phabricator.wikimedia.org/T113467) (owner: 10ArielGlenn) [20:44:01] going off again, we'll see tomorrow if more api appservers will show the issue [20:44:17] Have a good one elukey [20:44:25] you too! :) [22:20:15] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:25] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:44] PROBLEM - HHVM rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds