[00:03:03] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2663785 (10Smalyshev) 05Open>03Resolved [00:03:06] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2663786 (10Smalyshev) [00:14:46] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [01:31:57] (03PS5) 1020after4: Make nginx optional in aptly class [puppet] - 10https://gerrit.wikimedia.org/r/312562 [01:35:10] (03PS6) 1020after4: Make nginx optional in aptly class [puppet] - 10https://gerrit.wikimedia.org/r/312562 [01:54:41] (03PS7) 1020after4: Make nginx optional in aptly class [puppet] - 10https://gerrit.wikimedia.org/r/312562 [02:39:02] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 16m 49s) [02:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Sep 24 02:44:59 UTC 2016 (duration 5m 57s) [02:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [02:56:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:32:25] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:03:49] (03CR) 1020after4: "Ok this works now. It's cherry-picked on beta - see https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Beta-Apt-Repo" [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [04:03:52] (03CR) 1020after4: "inline comments about what I changed and why" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [04:22:57] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:47:48] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:25:19] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:50:25] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:59:54] (03CR) 10Brion VIBBER: "Correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [07:08:31] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:26] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:15:12] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1532 [10:15:42] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:12] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 851304 Threads: 2 Questions: 188864634 Slow queries: 4586 Opens: 6741 Flush tables: 2 Open tables: 530 Queries per second avg: 221.853 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1444 [10:41:27] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:41:27] PROBLEM - Disk space on thumbor1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=85%) [10:45:34] <_joe_> ^^ this is thumbor not cleaning up its tmp dirs [10:46:41] RECOVERY - Disk space on thumbor1001 is OK: DISK OK [10:47:22] <_joe_> !log systemctl restart thumbor-instances.service on thumbor1001 freed 3 GB of space [10:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:48:49] <_joe_> well, it actually freed much more [11:20:44] _joe_: sigh, I'll take a look in the logs if there's an obvious reason [11:31:20] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:24] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:21:36] !log apply temporary cleanup of old (+20m) thumbor temporary files - T146262 [12:21:37] T146262: Temp files not cleaned up on conversion error - https://phabricator.wikimedia.org/T146262 [12:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:28:35] godog, is that 20 million? [12:28:53] Oh [12:28:56] :P [12:29:00] months? [12:48:11] haha no minutes Bsadowski1 [12:48:33] Oh [12:48:44] Ah, that it was completed in that amount of time? [13:05:09] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [13:07:01] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [13:15:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [13:20:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [13:24:59] Bsadowski1: nope just the mtime of the file itself [13:25:16] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [13:30:55] 06Operations, 10Traffic, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2664359 (10Danielsberger) I finally got some simulation results. I'm comparing the following seven caching policies: |**Name of Policy**|**Description of Policy**| | LRU | Pure... [13:32:38] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:39:43] 06Operations, 10Traffic, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2664376 (10Danielsberger) Happy to take input what to simulate next. Each simulation run takes about 48h. - 128 GB front mem cache? what's the actual size of cp4006? - disk cach... [13:40:34] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:40] brb reboot [14:01:56] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:05:49] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:24:44] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:00:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [15:05:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:40:38] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:06:17] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:20:35] 06Operations, 07RfC: #Varnish being used although archived - https://phabricator.wikimedia.org/T142244#2664688 (10Aklapper) 05Open>03declined Declining as per last two comments. [19:30:10] !log hhvm 1283-1290 rolling restart [19:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:40] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work): Resolve huge perf regression on autocomplete queries - https://phabricator.wikimedia.org/T146465#2664729 (10EBernhardson) @ema ran an hhvm restart on the effected instances and p95's have dropped down to pre-problem numbers, slo... [19:45:07] (03CR) 10Hashar: "I have removed the patch from the beta cluster since I7383440288bd3394cd0660fec0f402f55009ce19 got merged" [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [19:45:21] (03CR) 10Hashar: "I have dropped the patch from the beta cluster since I7383440288bd3394cd0660fec0f402f55009ce19 got merged" [puppet] - 10https://gerrit.wikimedia.org/r/310360 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [19:49:27] (03PS1) 10Hashar: mariadb: fix class dependency on beta [puppet] - 10https://gerrit.wikimedia.org/r/312652 [21:58:03] (03PS1) 10EBernhardson: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 [22:02:26] (03CR) 10Hashar: "I eventually looked at the code. The FileSink can be dropped in favor of using subprocess.check_output() which captures STDOUT for you." (033 comments) [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [22:02:48] (03PS2) 10EBernhardson: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 [22:04:07] ebernhardson: almost :] [22:04:15] I thought Jessie had introduced /user/ ! [22:05:01] :P [22:05:14] ebernhardson: feel free to cherry pick it on beta :D [22:05:22] we no more have any precise [22:05:37] (03CR) 10Hashar: [C: 031] Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson) [22:07:34] sleep time! *wave* [22:08:36] hashar: g'night. don't work on sunday ;) [22:08:43] oh [22:08:49] was just hobby reviewing some python code :D [22:11:39] (03CR) 10Alex Monk: "We have servers with trusty and we have servers with jessie - shouldn't we have sudo rules that allow us to work on both?" [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson) [22:13:05] PROBLEM - Disk space on scb1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=87%) [22:15:57] (03CR) 10Paladox: "I would have thought this will work on both os." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson) [22:49:16] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:16:34] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures