[00:05:56] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:07:56] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.015 second response time [00:09:56] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.010 second response time [00:21:56] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.005 second response time [00:25:56] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.006 second response time [00:33:56] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [00:34:56] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:37:56] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.005 second response time [00:44:56] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.007 second response time [00:45:26] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:45:36] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:26] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [00:46:26] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [01:01:56] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:13:06] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:56] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1801.636919 Seconds [01:35:56] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 40.745574 Seconds [01:42:06] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [02:17:56] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:27:19] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 4 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2939249 (10Liuxinyu970226) [02:30:43] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 12m 03s) [02:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:08] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jan 14 02:35:07 UTC 2017 (duration 4m 25s) [02:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:56] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:15:36] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:26] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [04:46:06] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [04:54:06] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [05:04:06] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [05:04:56] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:12:06] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [05:32:56] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:09:26] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 57 probes of 403 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:14:26] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 403 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:37:26] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:02:33] (03PS1) 10Urbanecm: Namespace aliases on Bhojpuri Wikipedia (bhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332015 (https://phabricator.wikimedia.org/T155278) [08:05:26] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:49:37] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 4 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2939601 (10Revent) @brion Yes, the 'pending' queue is now empty, other than brief spikes (mainly due to spurts... [08:57:06] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:46] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:08:36] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [09:15:56] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:25:06] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:37:26] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:43:56] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:54:06] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:59:16] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational [10:05:26] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:22:06] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:25:16] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:50:06] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [10:52:06] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [10:53:16] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:51:46] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:51:46] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:36] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:52:36] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:58:06] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [12:01:06] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [12:02:06] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [14:03:56] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:18] (03PS1) 10Urbanecm: Set wgBabelMainCategory for cswikiversity to Uživatel %code% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332046 (https://phabricator.wikimedia.org/T155301) [14:21:56] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:31:36] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:32:56] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:50:56] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:59:36] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:08:31] 06Operations, 10Ops-Access-Requests: Request to access hadoop (stat1004) for Ladsgroup - https://phabricator.wikimedia.org/T155303#2939870 (10Ladsgroup) [15:28:16] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:16] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:09:36] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:32:26] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:36] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:01:26] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:09:16] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:12:50] (03PS1) 10Urbanecm: Add *.leventhalmap.org to the copyupload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332053 (https://phabricator.wikimedia.org/T155309) [17:36:32] 06Operations, 10MediaWiki-API, 10Traffic: Action API caching does not work when logged in - https://phabricator.wikimedia.org/T155314#2940086 (10Tgr) [17:37:16] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:33:44] (03PS2) 10Tim Landscheidt: Tools: Remove redundant tools-db entry from /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/328453 (https://phabricator.wikimedia.org/T139190) [18:36:00] 06Operations, 10MediaWiki-API, 10Traffic: Action API caching does not work when logged in - https://phabricator.wikimedia.org/T155314#2940086 (10Anomie) Duplicate of T97096 (I'm on my phone or I'd close it as such). [18:47:47] 06Operations, 10MediaWiki-API, 10Traffic: Action API caching does not work when logged in - https://phabricator.wikimedia.org/T155314#2940216 (10SamanthaNguyen) [19:33:17] (03CR) 10Tim Landscheidt: "I hadn't seen this patch, so I submitted an identical one as Iaf73d6f52ec402cb7c1b7eebd0bc462b55343825." [puppet] - 10https://gerrit.wikimedia.org/r/274566 (owner: 10Dduvall) [19:46:26] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:04:09] (03CR) 10Tim Landscheidt: puppetmaster: Enable expand_path for Hiera in Labs as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt) [20:16:26] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:55:02] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940395 (10Dereckson) [21:03:08] 06Operations, 10MediaWiki-API, 10Traffic: Action API caching does not work when logged in - https://phabricator.wikimedia.org/T155314#2940457 (10Tgr) 05duplicate>03Open That task is about the MediaWiki API not emitting caching headers when `uselang=content` is not used. This is about Varnish marking such... [21:05:18] (03PS2) 10Tim Landscheidt: tools: Automount ldap.yaml too onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 (owner: 10Yuvipanda) [21:06:21] 06Operations, 10MediaWiki-API, 10Traffic: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#2940461 (10Tgr) [21:06:47] 06Operations, 10MediaWiki-API, 10Traffic: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#2940086 (10Tgr) [21:13:11] (03CR) 10Tim Landscheidt: "Don't know if relevant: In OSM, the project page for bastion is not updated when users are added or removed from that project because SMW " [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [21:22:05] (03CR) 10Tim Landscheidt: [C: 04-1] labs nfsclient: Require /mnt/nfs's existence before trying to mount underneath it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/313034 (owner: 10Alex Monk) [21:24:02] (03CR) 10Tim Landscheidt: [C: 031] Add tools hiera common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/325041 (owner: 10Merlijn van Deen) [21:38:26] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:39:26] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 2 minutes ago with 15 failures. Failed resources (up to 3 shown): Service[salt-minion],Package[command-not-found-data],Package[os-prober],Package[python3-apport] [22:07:16] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:07:26] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:38:16] (03CR) 10Tim Landscheidt: "T133412 was resolved some time ago; is this patch still worth pursuing?" [puppet] - 10https://gerrit.wikimedia.org/r/296830 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [22:39:46] (03CR) 10Tim Landscheidt: "Is this patch still current?" [software] - 10https://gerrit.wikimedia.org/r/233478 (owner: 10ArielGlenn) [22:41:50] (03CR) 10Tim Landscheidt: "Is this patch still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/265847 (owner: 10Rush) [22:46:06] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:47:43] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 4 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2940556 (10matmarex) Is this resolved then? [22:55:28] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 4 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2940569 (10brion) 05Open>03Resolved Yeah I think we're good to close this one out; improvements to the queu... [23:07:58] (03CR) 10Tim Landscheidt: "Is this patch still on the table?" [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [23:13:06] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [23:31:06] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:52:48] (03CR) 10Paladox: "recheck" [software] - 10https://gerrit.wikimedia.org/r/233478 (owner: 10ArielGlenn) [23:52:56] (03CR) 10jerkins-bot: [V: 04-1] retention: split setup.py into two files, for clients and master [software] - 10https://gerrit.wikimedia.org/r/233478 (owner: 10ArielGlenn) [23:59:06] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures