[00:07:50] Reedy or James_F around? I'd like to be marked as a helper in #wikipedia-userscripts please. [00:09:01] T13|mobile: I don't think I have access, sorry. [00:09:54] James_F: :09] -ChanServ-: 4 James_F +AVefiorstv (Admin) [modified 6y 9w 3d ago] [00:10:03] Chanserv says you do. [00:10:07] Wow. Really? Ha. [00:11:26] However, I'm certainly not active there. [00:11:33] I'd be happy with being in the helper group, but wouldn't oppose being an admin since the channel is almost always empty. [00:11:33] * James_F defers to Reedy. [00:12:00] Yeah, no-one's active there, that's why I'm here. Lol [00:13:30] !log restarted elasticserch on logstash1001 & logstash1003; OOM [00:13:37] Logged the message, Master [00:13:40] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 62 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 58, utimed_out: False, uactive_primary_shards: 42, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 64, uinitializing_shards: 4, unumber_of_data_nodes: 3} [00:15:20] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 32 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 28, utimed_out: False, uactive_primary_shards: 42, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 94, uinitializing_shards: 4, unumber_of_data_nodes: 3} [00:15:49] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 32 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 28, utimed_out: False, uactive_primary_shards: 42, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 94, uinitializing_shards: 4, unumber_of_data_nodes: 3} [00:18:22] T13|mobile: is that channel different from #mediawiki-scripts? [00:18:59] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 124, initializing_shards: 2, number_of_data_nodes: 3 [00:19:29] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 124, initializing_shards: 2, number_of_data_nodes: 3 [00:19:49] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 124, initializing_shards: 2, number_of_data_nodes: 3 [00:27:00] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: puppet fail [00:45:57] 3Wikimedia-Logstash, operations, ops-core: Upgrade RAM for logstash100[123] to 64G - https://phabricator.wikimedia.org/T87078#983418 (10bd808) 3NEW [00:46:19] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:48:54] 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#983426 (10bd808) If the ram from these is compatible with the logstash100[123] boxes we have then maybe {T87078} can be done by just moving some sticks from one to another? [00:57:12] 3Wikimedia-Logstash, operations, ops-core: Upgrade RAM for logstash100[123] to 64G - https://phabricator.wikimedia.org/T87078#983430 (10ori) [00:58:15] greg-g: I'm changing "set 0" to "group 0" in https://wikitech.wikimedia.org/wiki/Deployments/One_week to match other pages [01:00:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [01:31:59] PROBLEM - dhclient process on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:31:59] PROBLEM - salt-minion processes on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:10] PROBLEM - configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:19] PROBLEM - SSH on rhenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:32:19] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:29] PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:40] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:49] PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:46:50] PROBLEM - NTP on rhenium is CRITICAL: NTP CRITICAL: No response from NTP server [02:01:28] (03PS1) 10Ori.livneh: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 [02:02:16] AaronSchulz: ^ [02:03:02] (03PS3) 10Ori.livneh: admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 [02:03:08] (03CR) 10Ori.livneh: [C: 032 V: 032] admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 (owner: 10Ori.livneh) [02:05:49] PROBLEM - Ori committing changes on the weekend on palladium is CRITICAL: CRITICAL: Ori committed a change on a weekend [02:05:51] ori or anyone: are requests to testwiki never cached? It seems so, response is always X-Cache: cp1054 miss (0), cp4017 miss (0), cp4017 frontend miss (0) [02:08:15] (03PS2) 10Ori.livneh: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 [02:08:20] spagewmf: yes, by design [02:08:42] makes debugging PHP bugs simpler. [02:09:11] (03PS3) 10Ori.livneh: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 [02:09:41] (03CR) 10Aaron Schulz: [C: 031] xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 (owner: 10Ori.livneh) [02:09:51] (03CR) 10Ori.livneh: [C: 032] xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 (owner: 10Ori.livneh) [02:09:56] (03Merged) 10jenkins-bot: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 (owner: 10Ori.livneh) [02:10:20] ori: thanks, I'm updating some wikitech pages with the X-Wikimedia-Debug: 1 trick. [02:10:27] !log ori Synchronized wmf-config/StartProfiler.php: I4e3871d3d: xenon: Annotate file scope and closure scope with filename (duration: 00m 05s) [02:10:40] Logged the message, Master [02:10:44] spagewmf: ooooh! Thanks a ton. The X-Wikimedia-Debug trick actually lets you bypass the cache for other wikis, too. [02:11:31] yup, I've been `curl -H 'X-Wikimedia-Debug: 1' --dump-header - https://it.wikiquote.org/wiki/Pagina_principale`. Reply to your fine post coming [02:12:35] spagewmf: I also realized I forgot to explain how to install a Chrome / Chromium extension from GitHub. You need to clone it locally and then tell Chrome to 'Load unpacked extension...' [02:12:45] I could just add it to the Chrome Web Store, that would probably make things easier. [02:14:17] do I need some special right beyond cluster access to be able to log in to the job queue hosts? [02:14:32] !log krinkle Synchronized php-1.25wmf15/resources/src/mediawiki/mediawiki.content.json.css: Ic1d10393912fcefa22d (duration: 00m 06s) [02:14:38] Logged the message, Master [02:14:40] PROBLEM - puppet last run on pc1003 is CRITICAL: CRITICAL: Puppet has 1 failures [02:14:44] tgr: not sure. have you tried? [02:14:52] !log krinkle Synchronized php-1.25wmf15/includes/content/JsonContent.php: Ic1d10393912fcefa22d (duration: 00m 05s) [02:14:55] Logged the message, Master [02:15:16] ori: yes, the normal SSH key does not seem to work [02:15:39] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:10] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:12] !log krinkle Synchronized php-1.25wmf14/resources/src/mediawiki/mediawiki.content.json.css: Ic1d10393912fcefa22d (duration: 00m 06s) [02:16:13] yeah, it seems like the new puppetization does not include access rights for wikidev. not sure if that was by design. you'd want to check with _joe_. in the interim, is there anything i can do to help you? [02:16:16] Logged the message, Master [02:16:20] i'll check mw1148 [02:16:23] !log krinkle Synchronized php-1.25wmf14/includes/content/JsonContent.php: Ic1d10393912fcefa22d (duration: 00m 06s) [02:16:28] Logged the message, Master [02:16:39] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:17:20] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 64978 bytes in 0.389 second response time [02:17:25] mw1148: threads in stuck in __lll_lock_wait (); restarted HHVM. [02:17:29] errrrr [02:17:30] ori: I'm trying to debug https://phabricator.wikimedia.org/T87040 [02:17:31] !log mw1148: threads in stuck in __lll_lock_wait (); restarted HHVM. [02:17:34] Logged the message, Master [02:17:50] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.068 second response time [02:17:56] could you run the query in https://wikitech.wikimedia.org/wiki/Job_queue#Examining_the_data_for_a_job for gwt? [02:18:20] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s) [02:18:23] Logged the message, Master [02:18:24] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-17 02:18:24+00:00 [02:18:28] Logged the message, Master [02:18:51] tgr: you can do that yourself without sshing to the host; just run 'redis-cli -h rdb1001' [02:18:53] on tin [02:20:19] ori: thanks! https://wikitech.wikimedia.org/wiki/Redis is a bit outdated then, I'll update it [02:20:32] cool. thank you! [02:28:49] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:30:53] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s) [02:30:57] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-17 02:30:57+00:00 [02:30:59] Logged the message, Master [02:31:03] Logged the message, Master [02:32:40] RECOVERY - puppet last run on pc1003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:41:57] ori: is there a way to find out the redis password without going to the redis host and looking up the config file? [02:42:13] I thought I could do something like "hiera 'passwords::redis::main_password'" on tin but I can't get it to work [03:04:19] RECOVERY - Ori committing changes on the weekend on palladium is OK: OK: Ori is behaving himself [04:20:13] (03PS1) 10Tim Landscheidt: Add IP mapping for toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) [04:21:00] (03CR) 10Tim Landscheidt: "Based on assumption that toolserver.org = relic." [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) (owner: 10Tim Landscheidt) [04:40:19] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [04:44:48] (03PS1) 10Yuvipanda: beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) [04:44:55] (03CR) 10jenkins-bot: [V: 04-1] beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) (owner: 10Yuvipanda) [04:45:29] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [04:45:44] (03PS2) 10Yuvipanda: beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) [04:47:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jan 17 04:47:42 UTC 2015 (duration 47m 41s) [04:47:52] Logged the message, Master [05:13:24] (03PS1) 10Yuvipanda: parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) [05:13:32] (03CR) 10jenkins-bot: [V: 04-1] parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) (owner: 10Yuvipanda) [05:13:59] (03PS2) 10Yuvipanda: parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) [05:14:06] (03PS2) 10Yuvipanda: beta: Remove dup of /home/mwdeploy/.ssh [puppet] - 10https://gerrit.wikimedia.org/r/185570 (owner: 10BryanDavis) [05:14:15] (03CR) 10Yuvipanda: [C: 032] beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) (owner: 10Yuvipanda) [05:14:35] (03CR) 10Yuvipanda: [C: 032] parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) (owner: 10Yuvipanda) [05:16:14] (03PS3) 10Yuvipanda: beta: Remove dup of /home/mwdeploy/.ssh [puppet] - 10https://gerrit.wikimedia.org/r/185570 (owner: 10BryanDavis) [05:17:38] (03CR) 10Yuvipanda: [C: 032] beta: Remove dup of /home/mwdeploy/.ssh [puppet] - 10https://gerrit.wikimedia.org/r/185570 (owner: 10BryanDavis) [05:38:00] (03PS1) 10Yuvipanda: parsoid: Include base::firewall on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/185610 [05:38:06] (03CR) 10jenkins-bot: [V: 04-1] parsoid: Include base::firewall on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/185610 (owner: 10Yuvipanda) [05:38:18] (03PS2) 10Yuvipanda: parsoid: Include base::firewall on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/185610 [05:39:30] akosiaris: ^ [05:52:09] (03PS1) 10Yuvipanda: beta: Monitor cxserver, mathoid, citoid and apertium services [puppet] - 10https://gerrit.wikimedia.org/r/185613 (https://phabricator.wikimedia.org/T87087) [05:52:58] (03CR) 10Yuvipanda: [C: 032] beta: Monitor cxserver, mathoid, citoid and apertium services [puppet] - 10https://gerrit.wikimedia.org/r/185613 (https://phabricator.wikimedia.org/T87087) (owner: 10Yuvipanda) [05:59:22] (03PS2) 10Glaisher: Redirect wikibook(s).(org|com) to www.wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) [06:02:20] (03CR) 10Glaisher: "Updated now, this is a better solution, imo." [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher) [06:05:10] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:05:11] (03PS2) 10Yuvipanda: Add IP mapping for toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) (owner: 10Tim Landscheidt) [06:05:41] (03CR) 10Yuvipanda: [C: 032] "Ideally at some point in the future we'd automate this, but until then..." [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) (owner: 10Tim Landscheidt) [06:07:39] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:28:29] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:10] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:39] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:59] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] hmm [06:31:34] these all seem to fail *and* /var/log/puppet.log is empy [06:31:37] *empty [06:36:23] !log restarted dnsmasq on labnet1001 (see https://wikitech.wikimedia.org/wiki/Labs_DNS#DHCP_and_internal_DNS for how to) [06:36:29] Logged the message, Master [06:44:49] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:10] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:44] (03PS1) 10Yuvipanda: deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 [06:49:45] YuviPanda: maybe it happens whenever logrotate rotates a log file while a puppet run is in progress [06:49:53] ori: yeah, that’s what I was thinking as well [06:50:07] didn’t check though, got distracted [06:50:17] it has been an issue for months [06:50:38] perhaps we can make logrotate abort if puppet is running atm [06:50:51] with a pid check [06:51:02] how many servers do we have in total? [06:51:19] good question. [06:51:22] I don’t actually know [06:52:08] I guess I can find out by looking at the salt certificates list [06:52:28] 819 [06:52:39] and puppet runs every 20 minutes, right? [06:52:52] and stays running, for, say, a minute on average [06:53:27] (yeah, 819) [06:53:50] (on that note, we record puppet running time on labs, perhaps not a bad idea to do so on prod too) [06:54:00] so at any given moment 40 hosts are running puppet, right? [06:54:28] that doesn't match up with the number of hosts that were affected (7) [06:54:43] but maybe the conditions are more specific [06:54:56] or maybe it’s less than a minute now, because apt-get is outside. [06:55:27] i expect the log files are useful (if not puppet, then dmesg) [06:55:41] I checked them. no failures reported [06:55:46] on puppet that is [06:55:46] but anyways, whereabouts are you at? still in chennai? [06:55:49] didn’t see dmesg [06:55:59] ori: I am! I fly out tonight, will be in SF sunday afternoon [06:56:32] oh man, i hate long flights [06:57:12] the worst is when you can't sleep and you watch the movie on your neighbor's display without any sound [06:57:17] ori: I’ve been asked to stop drinking as well now, so my usual coping mechanism isn’t going to work. *And* I can’t use a computer on flight either... [06:57:30] ori: YUP. Terrible. and all my neighbors seem to have terrible taset [06:57:30] and you just see all the formulas and cliches [06:57:31] *taste [06:57:36] yeah. [06:57:45] I have a kindle now, though [06:57:59] and am reading a BIND book. Might add TLDP’s LDAP one to it and read it during the flight too [06:58:44] (re: formulas, https://www.youtube.com/watch?v=0zYYCCsSjkw) [06:59:05] * YuviPanda is in coffe place with terrible 3G, will watch when he gets home [06:59:28] just a silly comedic video :) [06:59:41] ori: have you read / come across http://www.cs.virginia.edu/~robins/YouAndYourResearch.html before [06:59:59] nope! is it worth reading? [07:00:07] ori: yeah, I think so. [07:00:15] nice, i'll read it. [07:00:43] ori: there’s a video of the talk as well https://www.youtube.com/watch?v=a1zDuOPkMSw [07:01:05] it’s interesting, and has some rather good points I think. I would definitely be interested in seeing what you think about it, ori :) [07:01:40] I think it should be called ‘You and Your Work’. It specifically talks about research, but I think the points are generalizable to any work you care about. [07:34:49] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [07:52:49] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:14:14] <_joe_> good morning/evening [08:14:38] * YuviPanda waves at _joe_ [08:15:11] <_joe_> YuviPanda: which way are you flying to the us? [08:15:30] _joe_: frnakfurt [08:15:32] *frankfurt [08:15:41] <_joe_> oh, like me [08:16:00] _joe_: yeah. although that makes the flight 20+ hours long [08:16:05] instead of flying through hong kong [08:16:33] <_joe_> yeah I thought the east route maybe shorter [08:17:43] _joe_: yeah, 19.5h (via HK) vs close to 24h now [08:46:25] (03PS2) 10Yuvipanda: deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 [09:19:49] (03PS1) 10Yuvipanda: mediawiki: Move admin class inclusion from role to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185621 [09:21:27] (03PS2) 10Yuvipanda: mediawiki: Move admin class inclusion from role to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185621 [09:23:04] (03PS1) 10Yuvipanda: mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 [09:23:34] (03Abandoned) 10Yuvipanda: mediawiki: Move admin class inclusion from role to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185621 (owner: 10Yuvipanda) [09:25:54] (03CR) 10Giuseppe Lavagetto: [C: 031] "I will remove the inclusion and the realm guard as soon as possible anyways." [puppet] - 10https://gerrit.wikimedia.org/r/185622 (owner: 10Yuvipanda) [09:29:04] (03PS2) 10Yuvipanda: mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 [09:29:29] (03PS3) 10Yuvipanda: mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 [09:31:11] (03CR) 10Yuvipanda: [C: 032] mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 (owner: 10Yuvipanda) [09:33:40] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [09:39:31] (03PS1) 10Yuvipanda: ocg: Don't include admin class in labs [puppet] - 10https://gerrit.wikimedia.org/r/185625 [09:40:09] (03PS2) 10Yuvipanda: ocg: Don't include admin class in labs [puppet] - 10https://gerrit.wikimedia.org/r/185625 [09:40:56] (03CR) 10Yuvipanda: [C: 032] ocg: Don't include admin class in labs [puppet] - 10https://gerrit.wikimedia.org/r/185625 (owner: 10Yuvipanda) [09:52:03] (03PS3) 10Yuvipanda: deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 [09:54:40] (03CR) 10Yuvipanda: [C: 032] deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 (owner: 10Yuvipanda) [10:07:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [10:27:50] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:53:33] spagewmf: thanks! [18:40:39] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [18:58:39] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:20:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [19:23:49] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:23:49] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 4 failures [19:24:00] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:24:59] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [19:38:10] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:38:10] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:38:29] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:41:59] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:42:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:17:59] Hi [20:18:26] Is this the right place to make a request for a rather specific project? [20:19:36] Namely would it be possible to setup an OSM like stack on Wikimedia or Labs toolserver that I could use to make a 'free' (or reasonably free map of the world's rail networks? [21:04:39] maps in labs: #wikimedia-labs but you should remember that these maps should be for Wikimedia projects and that labs performance is... imperfect for OSM stack (I checked). maps in the main cluster: there are such plans, expect more news in the next 10-15 years [21:04:50] Qcoder00, ^ [21:05:06] OK [23:34:23] anybody able to diagnose the problem with the RC feed on Commons ? [23:35:06] what is the problem exactly? [23:35:14] it appears to be only showing GWToolset Log entries [23:35:50] and broken log entries too, ew. [23:36:35] Krenair: https://phabricator.wikimedia.org/T87040 [23:36:50] haven't had the time to look into it yet [23:37:13] it's a single job, can be killed if the log spam is a problem in itself [23:38:26] yeah, that's flooding RC [23:38:36] Special:NewFiles is still working, but I don't think any of the patrollers are able to do much on Commons right now because of the RC flooding. [23:38:39] from two users [23:39:18] tgr, so those are duplicate logs for the same upload event? [23:39:29] or are they all distinct events that should be marked as bot? [23:40:45] wow, I just opened #commons.wikimedia. [23:41:08] should probably kill it [23:44:51] tgr, ori: did one of you kill it? [23:44:53] looks like no [23:45:33] give me a sec to figure out how that's done [23:50:18] ori, do you know? [23:51:04] ori: do you think anything will explode if I just delete the job table entry? [23:51:21] there doesn't seem to be any documentation around about killing jobs [23:51:30] I doubt that would stop a job that's already executing [23:52:56] delete the gwtoolset job list in redis then? [23:53:29] the job is not exactly running, it is continuously rescheduling itself somehow, I think [23:53:50] tgr, so those are duplicate logs for the same upload event? [23:53:50] or are they all distinct events that should be marked as bot? [23:54:06] it's a job for a single small upload being run multiple times? [23:54:24] it's not an upload event, as far as I understand, this is GWT trying to fetch a metadata file from some GLAM server [23:54:53] and there should be limits and throttling and whatnot to it, but clearly that's borked [23:55:52] https://github.com/wikimedia/mediawiki-extensions-GWToolset/blob/master/includes/Jobs/UploadMetadataJob.php [23:56:46] the actual uploading is done by UploadMediafileJob from what I remember, a bunch of instances of that job would be created by UploadMetadataJob but it's not even getting to that point [23:57:25] ew it recreates itself? [23:58:24] yup [23:58:44] it's an exponential backoff type thing, I think [23:59:02] still it's not clear why it would recreate itself, then go and perform the log anyway [23:59:22] which theoretically should only happen when there are too many mediafile jobs, and there are 0 now