[01:01:01] hmm, we having any issues atm? Mw.org has been craaaazy slow for some actions [01:03:27] No one has complained... Nothing obviously related from teh monitoring [01:03:29] Jamesofur|cloud: what actions? [01:05:04] Checkuser was where I noticed it [01:05:32] like crazy slow, more then 30 seconds, a good 20 seconds of it without and obvious sign ANYTHING was happening [01:06:05] I was starting to look through the console to see if there were weird js errors when I pressed the go button before it finally started [01:06:17] other wikis don't seem to be having any issues with it [01:07:30] parsoid is hitting the api heavily again [01:07:34] hmm, now having issues on enWiki too, it seems that it could be hanging while doing calls from my global/user js [01:07:43] either ruWiki or enWiki depending on the time [01:08:03] (the calls are coming from there, because I load scripts from there) [01:08:36] yeah, that's what i suspect is happening [01:08:52] global, aggregate perf looks fine: http://performance.wikimedia.org/#!/day [01:09:24] but: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=API+application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [01:09:57] oh fun [01:12:36] probably unrelated, but on https://performance.wikimedia.org/#!/month what's that firstPaint change on the 19th? [01:13:34] Krenair: the lower numbers for that metric prior to the 19th were a bug; there's a task for it [01:13:41] ah, ok [01:17:13] Jamesofur|cloud: I gotta run and do groceries. If you know how, it'd be useful to get a waterfall view of a page load for your account. If you have Chrome, you can go to View -> Developer -> Developer tools, then click Timeline, then reload the page (if you reload the page with the timeline view open it will automatically record the page view) [01:17:26] then you can right click on the resultant view and click "save timeline data..." [01:17:59] if you can capture that and email that to me that would be very useful [01:18:51] * Jamesofur|cloud nods [01:18:53] will do [01:21:31] thanks. i sent an email to ops@ about parsoid / api [01:33:19] ori: [01:33:53] oops, so errr my kernel panicked when I tried to save the time line... and now it's fine... so I'm guessing at least part of it was on my end [01:54:45] ori: if you look at the numbers, something else seems to hit the API much heavier than Parsoid [01:59:53] overall the traffic looks fairly normal: https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=API+application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [02:20:53] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 06m 16s) [02:21:10] Logged the message, Master [02:25:44] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-10 02:24:40+00:00 [02:25:54] Logged the message, Master [02:41:29] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 26s) [02:41:41] Logged the message, Master [02:45:51] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-10 02:44:48+00:00 [02:45:57] Logged the message, Master [03:30:20] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1274737 (10jeremyb) [03:34:47] PROBLEM - puppet last run on mw1053 is CRITICAL Puppet has 1 failures [03:34:56] PROBLEM - puppet last run on mw1163 is CRITICAL Puppet has 1 failures [03:36:07] PROBLEM - puppet last run on mw1029 is CRITICAL Puppet has 1 failures [03:40:57] PROBLEM - puppet last run on mw2182 is CRITICAL puppet fail [03:41:36] PROBLEM - puppet last run on mw2138 is CRITICAL puppet fail [03:42:47] PROBLEM - puppet last run on neon is CRITICAL puppet fail [03:46:58] PROBLEM - puppet last run on es1008 is CRITICAL puppet fail [03:48:57] PROBLEM - puppet last run on lvs1002 is CRITICAL Puppet has 1 failures [03:49:08] PROBLEM - puppet last run on elastic1007 is CRITICAL Puppet has 1 failures [03:56:46] RECOVERY - puppet last run on mw1029 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:56:57] RECOVERY - puppet last run on mw1053 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:57:06] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [04:00:06] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:00:46] RECOVERY - puppet last run on mw2138 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:01:57] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [04:02:57] RECOVERY - puppet last run on es1008 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:04:57] RECOVERY - puppet last run on lvs1002 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:05:08] RECOVERY - puppet last run on elastic1007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:10:20] (03PS1) 10Ori.livneh: exim: access template variable via '@' [puppet] - 10https://gerrit.wikimedia.org/r/209960 [04:42:42] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1274748 (10Tony_Tan_98) Information: Mozilla announced plan to deprecate Non-Secure HTTP: https://blog.mozilla.org/security/201... [04:42:58] (03PS1) 10Ori.livneh: interface: access template variable via '@' [puppet] - 10https://gerrit.wikimedia.org/r/209961 [05:17:13] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun May 10 05:16:10 UTC 2015 (duration 16m 9s) [05:17:18] Logged the message, Master [05:43:21] <_joe_> ori: oh, thanks for fixing that [05:43:51] <_joe_> I got once down to two warnings, but eventually gave up [05:43:59] <_joe_> when I faced nova [05:51:19] 6operations, 10Wikimedia-Interwiki-links: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915#1274759 (10TTO) 5Open>3Resolved nan: seems to work now. I'm not sure what was ever wrong with it; my September comment wasn't very specific. [05:51:20] 6operations, 10Wikimedia-Language-setup, 7Tracking: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) (tracking) - https://phabricator.wikimedia.org/T10217#1274761 (10TTO) [06:04:46] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [06:06:52] (03PS2) 10Ori.livneh: Update my gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/209214 (owner: 10Alex Monk) [06:07:07] (03CR) 10Ori.livneh: [C: 032] Update my gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/209214 (owner: 10Alex Monk) [06:07:32] (03CR) 10Ori.livneh: [V: 032] Update my gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/209214 (owner: 10Alex Monk) [06:12:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60587 bytes in 0.425 second response time [06:22:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:23:46] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:30:17] PROBLEM - puppet last run on logstash1006 is CRITICAL Puppet has 1 failures [06:33:47] PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on lvs2001 is CRITICAL Puppet has 1 failures [06:34:37] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:34:37] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:34:37] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:34:37] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:35:58] PROBLEM - puppet last run on mw2146 is CRITICAL Puppet has 1 failures [06:46:17] RECOVERY - puppet last run on logstash1006 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:36] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:49:16] RECOVERY - puppet last run on mw2146 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:20:17] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [07:20:29] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [07:51:18] PROBLEM - puppet last run on cp3005 is CRITICAL puppet fail [08:10:47] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:06:17] PROBLEM - puppet last run on eventlog1001 is CRITICAL Puppet has 29 failures [11:22:28] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:00:52] git is extremely slow. [12:07:58] (03PS1) 10Glaisher: Update nlwiki ContactPage recipient user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209967 [12:09:28] (03CR) 10Sjoerddebruin: [C: 031] Update nlwiki ContactPage recipient user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209967 (owner: 10Glaisher) [12:16:30] (03PS1) 10Merlijn van Deen: Add python-stdeb to -dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/209969 [12:17:12] (03CR) 10jenkins-bot: [V: 04-1] Add python-stdeb to -dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/209969 (owner: 10Merlijn van Deen) [12:22:01] (03CR) 10Merlijn van Deen: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/209969 (owner: 10Merlijn van Deen) [12:40:40] (03CR) 10Alex Monk: "See T98156" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209967 (owner: 10Glaisher) [15:17:37] PROBLEM - puppet last run on labvirt1008 is CRITICAL Puppet has 10 failures [15:22:23] (03PS1) 10Merlijn van Deen: Install flake8 (both python 2 and 3 versions) [puppet] - 10https://gerrit.wikimedia.org/r/209979 (https://phabricator.wikimedia.org/T90447) [15:33:56] RECOVERY - puppet last run on labvirt1008 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:49:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [16:03:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:10:37] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 24m atm and increasing - https://phabricator.wikimedia.org/T98621#1275032 (10Mlaffs) Now past 25.6m. I made edits to templates as far back as April 25th that haven't filtered through to the articles yet. [16:52:14] API request failed (internal_api_error_DBQueryError): [0b608d20] Database query error [16:52:20] what is this? [16:52:28] it is popping up since a few days... [16:52:56] Steinsplitter: Since it's a recurring issue, could you file a bug? [16:55:01] 6operations, 6Commons: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1275045 (10Steinsplitter) 3NEW [16:55:40] 6operations, 6Commons: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1275052 (10Steinsplitter) [16:56:31] 6operations, 6Commons, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1275054 (10Glaisher) [16:56:56] 6operations, 6Commons, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1275056 (10Steinsplitter) And sometimes file links remaining blue, after purging by hand they become red. Not sure if this is related. [17:00:56] 6operations, 6Commons, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1275072 (10hoo) Backtrace (page/ revision id removed): ``` 2015-05-10 16:52:01 mw1208 commonswiki exception INFO: [0b608d20] /w/api.php DBQueryError from line 127... [17:20:03] !log Inbound app server traffic more than doubled over the past 12 hrs: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [17:20:09] Logged the message, Master [17:37:10] hoo: is there some expensive wikidata job running? [17:37:27] Nothing except of the standard stuff [17:37:50] That spike comes from commons, btw [17:38:41] how do you know? [17:39:13] The s4 database have the same spike [17:39:16] so pretty likely [17:39:24] yes, was just noticing that [17:39:37] Looks like google is crawling heavily [17:39:41] From the running queries [17:41:27] Or maybe those just come in from other wikis [17:41:54] /wiki/User:Richenza/gallery [17:42:18] * ori nukes. [17:43:11] (03PS6) 10Southparkfan: Direct labsconsole.wm.o through Apache cluster [puppet] - 10https://gerrit.wikimedia.org/r/202788 (https://phabricator.wikimedia.org/T48554) [17:45:49] !log App server traffic coincides with spike on S4 dbs, lots of commons sleeper queries, fatal log contains many references to User:Richenza/gallery, so nuking. [17:45:56] Logged the message, Master [17:49:07] PROBLEM - RAID on ms-be2007 is CRITICAL 1 failed LD(s) (Offline) [17:49:30] (03PS3) 10Alex Monk: Direct labsconsole.wm.o through Apache cluster [dns] - 10https://gerrit.wikimedia.org/r/202791 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:49:36] (03PS7) 10Alex Monk: Direct labsconsole.wm.o through Apache cluster [puppet] - 10https://gerrit.wikimedia.org/r/202788 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:50:27] PROBLEM - Disk space on ms-be2007 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error [17:54:28] PROBLEM - puppet last run on ms-be2007 is CRITICAL Puppet has 1 failures [18:01:57] RECOVERY - Disk space on ms-be2007 is OK: DISK OK [18:06:32] ori: Did you actually nuke that page? [18:07:36] hoo: yes [18:08:13] Ah, I see... awesome Mediawiki is redirecting me to the main page [18:08:59] ori: you deleted this page? [18:10:03] Seems he removed the page row, yes [18:11:12] well... the users complaining strange things happening on commons. no logentry etc. I don't want cherrypiking... but at least a logentry schould be crted or a user notification [18:12:10] Steinsplitter: yes totally agree, i will let the user know on his/her talk page [18:12:17] fine :) [18:12:17] PROBLEM - puppet last run on mw2048 is CRITICAL puppet fail [18:15:40] ori, hmm... the right to delete pages should probably be part of the sysadmin group [18:15:52] and you should probably be in that group [18:16:37] https://meta.wikimedia.org/wiki/Special:GlobalUsers/sysadmin - I don't think JeLuF has server access anymore? [18:16:51] Krenair: No, he doesn't, but he's still active [18:16:57] You can ask him [18:17:02] probably in #wikidata right now [18:19:25] hi :) [18:20:21] * windowcat has two clients open...should probably pick one, meh [18:20:33] :) [18:21:07] https://tendril.wikimedia.org/host/view/db1064.eqiad.wmnet/3306 is very sad [18:21:17] all of them are [18:21:21] all s4, that is [18:21:34] confirmed that it's Googlebot crawling Djvu images [18:21:45] 13253450809 wikiuser 10.64.16.61:49810 commonswiki Query 0 Writing to net SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.114 */ img_metadata FROM `image` WHERE img_name = 'United_States_Statutes_at_Large_Volume_109_Part_1.djvu' AND img_timestamp = '20110629010313' LIMIT 1 0.000 [18:21:50] etc. etc. [18:21:56] yep [18:22:04] https://phabricator.wikimedia.org/T96360 is the bug [18:23:11] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1275096 (10aaron) >>! In T96360#1254487, @faidon wrote: >>>! In T96360#1253818, @aaron wrote: >> It probably... [18:30:06] AaronSchulz: getPageDimensions is loading the XML crap too [18:30:14] and is called from LocalFile.php [18:31:16] doTransform can still hit it via normaliseParams() => pageCount() btw [18:31:52] also cache misses from $wgMemc will still load the whole DB row, including img_metadata [18:31:57] RECOVERY - puppet last run on mw2048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:32:24] gwicke, do you have a call stack? [18:32:43] no, just grepped the code [18:33:06] PROBLEM - puppet last run on ms-be2001 is CRITICAL puppet fail [18:35:20] AaronSchulz: that call is in LocalFile.php [18:36:42] I have call stacks! [18:37:29] pageCount is a candidate too, it's potentially called from File.php [18:37:39] faidon@fluorine:/srv/xenon/logs/hourly$ grep loadExtraFromDB 2015-05-10_17.log | grep DjVu |wc -l [18:37:42] 304 [18:37:50] faidon@fluorine:/srv/xenon/logs/hourly$ grep loadExtraFromDB 2015-05-10_17.log | grep DjVu | grep -c getPageDimensions [18:37:53] 219 [18:38:33] "grep loadExtraFromDB 2015-05-10_17.log | grep DjVu | sort | uniq -c | sort -nr | head -10" is useful [18:38:52] http://p.defau.lt/?krgsuYV2naPicVa03GmTPA [18:39:58] * AaronSchulz didn't know we had those on fluorine [18:40:02] yup! [18:40:10] it's ori's "xenon" [18:40:30] ori set it up for flamegraph mostly [18:40:55] but they can be generally useful, although I admit it's the first time I'm using them for an outage :) [18:40:58] pretty cool [18:41:25] it just hit me [18:41:28] * AaronSchulz looks at grep -v Proofread | grep getMetaTree 2015-05-10_17.log [18:41:52] the majority is getPageDimensions, gwicke was right on his guess [18:42:07] logstash has a lot of info too [18:42:32] search for 'loadExtraFromDB' [18:42:41] gah, should be grep getMetaTree 2015-05-10_17.log | grep -v Proofread [18:43:16] almost all from enwikisource [18:43:43] not when Proofread is filtered out [18:43:50] which is not on commons [18:44:06] File::canRender;ImageHandler::canRender;LocalFile::getWidth;DjVuHandler::getPageDimensions;DjVuHandler::getMetaTree; [18:44:19] huh, a 3rd way to hit that :) [18:45:45] function canRender( $file ) { return ( $file->getWidth() && $file->getHeight() ); } [18:46:22] do we have lots of 0xY or Yx0 djvu/pdfs? ;) [18:47:41] this is slowly becoming this ground-up refactor that you were recommending [18:51:01] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:51:38] there's another path from ImagePage.php line 332 calls LocalFile->getWidth() [18:53:04] 6operations: tendril.wikimedia.org attempts to load external resources (fonts from google) - https://phabricator.wikimedia.org/T98710#1275117 (10Krenair) 3NEW [18:54:05] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown: dbtree loads third party resources - https://phabricator.wikimedia.org/T96499#1275125 (10Krenair) See also T98710 [18:54:22] too hard to avoid calling those methods...hack for now would be to make the handler cache pageCount/width,height in $wgMemc, keyed by sha1 [18:54:34] gwicke, seems sane? [18:57:36] AaronSchulz: sounds good to me [18:58:41] not sure it'll help much with crawling, as there's probably a very long tail of images [19:00:53] gwicke, most are not that huge though, right [19:01:26] paravoid, is there a huge set of different images causing problems? [19:01:45] not sure if it's huge, it's surely a big diverse set [19:01:47] wikisource probably [19:02:04] logstash shows a fairly wide distribution for the 'url' field [19:02:07] I guess one could query by img_metadata size on a research slave or something [19:03:29] gwicke, if it used the objectcache table it could be manually populated and won't fall out of cache [19:03:43] of course that really what img_metadata was for... [19:04:14] as a longer-term solution separate width / height columns might be better [19:05:14] So what is driving this increase in wikisource requests? Is that google spidering wikisource? [19:07:16] SkinTemplate::buildContentNavigationUrls;Hooks::run;ProofreadPage::onSkinTemplateNavigation;LocalFile::getWidth [19:07:21] logstash is showing most requests coming from requests for pages to http://en.wikisource.org/wiki/Page:Statesman's_Year-Book_1913.djvu/1385 and similar [19:07:47] the extension does width calls and such a lot [19:07:53] ProofreadPagePage.php line 191 calls LocalFile->getWidth() [19:10:03] the graph for 'getMetaTree' matches in logstash pretty much matches the db traffic [19:11:08] AaronSchulz: isn't the width a per-page thing? [19:13:39] gwicke, yes it can vary on page [19:14:27] any cache can iterator over $tree->BODY[0]->OBJECT to set the page => w,h map [19:15:51] some of those books seem to be huge [19:15:54] /wiki/Page:United_States_Statutes_at_Large_Volume_121.djvu/2194 [19:17:36] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [19:18:19] 2819 pages: http://en.wikisource.org/wiki/Index:United_States_Statutes_at_Large_Volume_121.djvu [19:18:41] gwicke: yes, it's Googleobt [19:18:43] *bot [19:33:35] gwicke, so, another option to consider is using poolcounter in getMetaTree if the file size is large [19:33:56] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:34:55] AaronSchulz: the distribution of urls doesn't show terribly hot ones [19:35:45] gwicke, it might need to use CACHE_DB with prepopulation [19:35:52] to me caching or storing a summary of the dimensions (and page count) for the entire file sounds most promising [19:36:33] a json array with w/h for each page [19:40:33] ideally the text would go in ES or swift and img_metadata would include dimensions and such [19:40:59] the whole field is assumed to be doing that by File/FileRepo, not storing 2mb of text [19:42:11] load seems more reasonable again [19:42:33] paravoid, was that you blocking something or did it just slow down? [19:42:41] I haven't done anything [19:44:07] AaronSchulz: a standard schema for img_metadata could be nice too, avoids unnecessary code duplication and establishes a policy of what goes in the light-weight summary & what should go elsewhere [19:50:48] gwicke, https://gerrit.wikimedia.org/r/#/c/209984/1 [19:51:24] I'm surprised, I thought that was done already... [19:51:40] AaronSchulz: will that enable the pool counter? [19:53:26] yes, FileRenderExpensive is already set [19:53:34] Mh... don't we scale 4k videos? [19:53:38] nm, it's a separate pool [19:53:45] $poolCounterType = 'FileRenderExpensive'; [19:53:58] hoo, in *theory* those use range requests :) [19:54:11] (03PS5) 10Yuvipanda: tools: silence sudo security e-mails [puppet] - 10https://gerrit.wikimedia.org/r/203876 (https://phabricator.wikimedia.org/T95882) (owner: 10Merlijn van Deen) [19:54:15] though you never quite know with this code [19:54:35] for some reason this was only set for tiff files so far [19:54:45] https://commons.wikimedia.org/w/index.php?title=File:Big_Buck_Bunny_4K.webm&action=purge is stupidly slow [19:54:50] might time out [19:55:00] ... 503, yay [19:57:53] (03CR) 10Yuvipanda: [C: 032] Install flake8 (both python 2 and 3 versions) [puppet] - 10https://gerrit.wikimedia.org/r/209979 (https://phabricator.wikimedia.org/T90447) (owner: 10Merlijn van Deen) [19:58:01] hoo, yeah oggthumb only works for oggs [19:58:28] webm probably uses ffmpeg, which does not support giving it a URL to use Range on [19:58:51] (03PS6) 10Yuvipanda: tools: silence sudo security e-mails [puppet] - 10https://gerrit.wikimedia.org/r/203876 (https://phabricator.wikimedia.org/T95882) (owner: 10Merlijn van Deen) [19:59:10] no doubt that should do something with isExpensiveToThumbnail ... [19:59:51] Ok, so using that inside an article doesn't make sense... most people wont be able to watch 4k videos, especially webm (given the missing hw accel on most platforms) [20:00:56] (03CR) 10Yuvipanda: [C: 032] tools: silence sudo security e-mails [puppet] - 10https://gerrit.wikimedia.org/r/203876 (https://phabricator.wikimedia.org/T95882) (owner: 10Merlijn van Deen) [20:01:15] (03CR) 10Yuvipanda: "Sigh, forgot to add module prefix (tools) to commit message" [puppet] - 10https://gerrit.wikimedia.org/r/209979 (https://phabricator.wikimedia.org/T90447) (owner: 10Merlijn van Deen) [20:02:03] hoo|away: 4k would be great if we had chunking and adaptive streaming [20:02:54] https://en.wikipedia.org/wiki/Dynamic_Adaptive_Streaming_over_HTTP [20:03:02] And support for ALL browsers... [20:04:25] so, assuming https://gerrit.wikimedia.org/r/#/c/209986/ can be deployed if anything happens again, I am going afk [20:10:29] (03PS1) 10Yuvipanda: tools: Make python3-flake8 trusty only [puppet] - 10https://gerrit.wikimedia.org/r/209988 [20:10:42] valhallasw: ^ [20:11:27] yuvipanda: it's more complicated, unfortunately [20:11:31] let me fix [20:11:35] valhallasw: flake8? [20:11:45] yuvipanda: it's python-flake8 on trusty [20:11:49] but pyflakes on precise [20:11:59] lol [20:12:00] haha [20:13:25] (03PS2) 10Yuvipanda: tools: Make python3-flake8 trusty only [puppet] - 10https://gerrit.wikimedia.org/r/209988 [20:13:26] valhallasw: ^ updated [20:14:01] (03PS3) 10Merlijn van Deen: tools: Fix flake packages in trusty vs precise [puppet] - 10https://gerrit.wikimedia.org/r/209988 (owner: 10Yuvipanda) [20:14:01] err [20:14:02] yuvipanda: ^ updated, you forgot python-flake8 in trusty ;-) [20:14:04] wrong commit message [20:14:16] and fix commit message, as well [20:14:30] valhallasw: thanks [20:16:04] (03CR) 10Yuvipanda: [C: 032] tools: Fix flake packages in trusty vs precise [puppet] - 10https://gerrit.wikimedia.org/r/209988 (owner: 10Yuvipanda) [20:20:40] (03PS1) 10Merlijn van Deen: tools: Fix trusty vs precise packages [puppet] - 10https://gerrit.wikimedia.org/r/209990 (https://phabricator.wikimedia.org/T97628) [20:29:16] valhallasw: ah, sweet [20:29:34] (03CR) 10Yuvipanda: [C: 032] tools: Fix trusty vs precise packages [puppet] - 10https://gerrit.wikimedia.org/r/209990 (https://phabricator.wikimedia.org/T97628) (owner: 10Merlijn van Deen) [20:30:28] valhallasw: I wrote https://github.com/yuvipanda/personal-wiki/blob/master/tools-dsh-generator.py the other day [20:31:08] yuvipanda: oh, neat [20:32:42] valhallasw: yeah, let me get rid of the ubuntu key notices [20:32:49] ? [20:33:25] Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu]: Not removing directory; use 'force' to override [20:33:26] and friends [20:33:38] ah [20:36:19] valhallasw: much better than messing around with salt [20:36:32] valhallasw: I was going to do this or setup a local salt master. this seems much better [20:39:06] yuvipanda: to do what, exactly? [20:39:19] why would we need to ssh to all hosts in practice? [20:39:19] valhallasw: ‘run command X on multiple hosts’? [20:39:27] puppet? [20:39:31] valhallasw: usually when I want to force a puppet run [20:39:34] ahh [20:39:40] or for one off tasks like this (removing ubuntu keys) [20:39:56] the ‘real’ fix is to fix the underlying image but that won’t help for hosts that already exist so [20:41:27] fair enough [20:51:43] fabric! [21:11:47] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [21:16:27] PROBLEM - puppet last run on db2012 is CRITICAL puppet fail [21:16:46] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [21:23:17] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [21:24:57] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [21:34:18] RECOVERY - puppet last run on db2012 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:46:06] PROBLEM - High load average on labstore1001 is CRITICAL 57.14% of data above the critical threshold [24.0] [21:49:17] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [21:57:27] PROBLEM - puppet last run on cp4009 is CRITICAL puppet fail [22:13:26] PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail [22:15:17] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:20:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 35.71% of data above the critical threshold [500.0] [22:22:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [22:30:37] PROBLEM - puppet last run on db2036 is CRITICAL puppet fail [22:31:08] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:36:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:40:20] (03PS1) 10Yuvipanda: ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) [22:40:25] (03CR) 10jenkins-bot: [V: 04-1] ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [22:41:51] (03PS2) 10Yuvipanda: ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) [22:46:47] RECOVERY - puppet last run on db2036 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [22:57:07] PROBLEM - DPKG on mw1017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:58:27] mw1017 is me [23:02:07] RECOVERY - DPKG on mw1017 is OK: All packages OK [23:08:32] varnent: Just to be safe I'm going to switch the banners to turn off 15 minutes early, so that no one clicks it at the last minute and gets locked out automatically from voting [23:08:37] when securepoll turns off [23:08:48] PROBLEM - DPKG on mw1021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:13:37] RECOVERY - DPKG on mw1021 is OK: All packages OK [23:21:56] PROBLEM - DPKG on mw1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:07] RECOVERY - DPKG on mw1020 is OK: All packages OK [23:46:28] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected