[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151029T0000). [00:01:27] (03PS2) 10Dzahn: etc,redis,dynamicproxy: fix some lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/249658 [00:01:46] (03PS1) 10Dzahn: pxe/dhcp: switch host radium to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/249660 (https://phabricator.wikimedia.org/T116963) [00:12:29] (03PS2) 10Dzahn: pxe/dhcp: switch host radium to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/249660 (https://phabricator.wikimedia.org/T116963) [00:12:38] (03CR) 10Dzahn: [C: 032] pxe/dhcp: switch host radium to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/249660 (https://phabricator.wikimedia.org/T116963) (owner: 10Dzahn) [00:20:11] (03PS1) 10Dzahn: servermon,ganglia,gerrit: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249661 [00:21:10] (03PS2) 10Dzahn: servermon,ganglia,gerrit: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249661 [00:21:57] (03CR) 10Dzahn: [C: 032] "trying to be not too small and not too large, these are all one-line fixes just spread out across all roles and modules" [puppet] - 10https://gerrit.wikimedia.org/r/249661 (owner: 10Dzahn) [00:32:50] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /home 41022 MB (3% inode=99%) [00:34:46] ^ where the 3% are still 39G :p [00:35:04] checks for users with largest home's [00:37:25] finds one user with about 400G /home [00:37:46] and several with around 50 or 60G each [00:41:43] * Platonides misread it as "a user with 400Gbit at home" [00:42:13] was wondering about its provider :) [00:43:57] Platonides: 400 https://en.wikipedia.org/wiki/Google_Fiber plans :P [00:46:16] xD [00:46:33] 6operations, 6Analytics-Engineering: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764505 (10Dzahn) 3NEW [00:47:17] 6operations, 6Analytics-Engineering: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764517 (10Dzahn) /dev/mapper/tank-home 1008G 979G 30G 98% /home 398G ellery 73G nuria 64G west1 58G ironholds 54G mforns 48G ezachte 47G spetrea 46G jamesur 43G milimetric 40G madhuvishy 3... [00:47:41] it seems somebody is still writing [00:48:27] that's probably the article reccomendation stuff [00:48:43] its gonna run out soon this way [00:48:46] down to 2% [00:48:55] i see more than one active user.. hrmm [00:49:07] yeah let me see if I can get ellery's attention [00:49:15] cool, thanks! [00:49:16] mutante: is the 400G user the one with most activity? [00:49:31] ellery and nuria and ezachte.. i see in top [00:49:48] ezachte is running the normal perl scripts, so i thin ktnot that [00:49:52] the other 2 are java [00:50:23] (03CR) 10Alex Monk: "https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers update was skipped, I've added it now" [puppet] - 10https://gerrit.wikimedia.org/r/242187 (https://phabricator.wikimedia.org/T106142) (owner: 10Andrew Bogott) [00:51:01] 6operations, 6Analytics-Engineering: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764524 (10Milimetric) I just deleted mine, you can safely delete /home/spetrea (he no longer works here). The stuff in ellery's folder looks hard to delete but I'll try to get him to store the... [00:52:07] mutante: lol, fatrace fnds that volume too large >_> [00:52:24] YuviPanda: ^ i deleted spetrea's home because he is not with wmf anymore [00:52:34] per comment from milimetric [00:52:38] mutante: cool [00:52:47] it should recover now [00:52:49] mutante: me and leila have emailed ellery [00:52:50] RECOVERY - Disk space on stat1002 is OK: DISK OK [00:52:55] thanks, great [00:53:01] YuviPanda: me too! oops [00:53:03] poor ellery [00:53:16] milimetric: hehe [00:53:49] milimetric: unrelated did you see https://phabricator.wikimedia.org/T112321 [00:53:56] 6operations, 6Analytics-Engineering: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764525 (10Dzahn) @milimetric thank you! i deleted spetrea's data. Yuvi and Leila have mailed ellery. 17:55 < icinga-wm> RECOVERY - Disk space on stat1002 is OK: DISK OK we have 10% free agai... [00:55:50] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:08] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1764529 (10GWicke) [00:56:10] 6operations, 10RESTBase, 6Services, 5Patch-For-Review: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1764527 (10GWicke) 5Open>3Resolved The fix was deployed this morning. [00:57:49] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 918790 bytes in 7.363 second response time [00:58:14] YuviPanda: yeah, we talked a bit about productizing today [00:58:41] ellery was saying he might just leverage google search for now, and worry about a better approach later [00:58:57] I was saying y'all should look into elastic search, it might be the best tool for this job [00:59:09] yeah I heard a little bit of that too but wasn't really sure where that came in [01:01:42] (03PS1) 10Dzahn: base: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249666 [01:06:46] !log lowered gc_grace on wikipedia parsoid html and data-parsoid keyspaces to 24 hours [01:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:19:51] (03Abandoned) 10Dzahn: lint: double quoted strings pt.4 [puppet] - 10https://gerrit.wikimedia.org/r/243858 (owner: 10Dzahn) [01:19:57] (03Abandoned) 10Dzahn: lint: double quoted strings pt.3 [puppet] - 10https://gerrit.wikimedia.org/r/243855 (owner: 10Dzahn) [01:20:24] (03Abandoned) 10Dzahn: lint: double quoted strings pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/243853 (owner: 10Dzahn) [01:20:27] (03PS1) 10Dzahn: diamond,zotero,reprepro: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249670 [01:22:23] (03PS2) 10Dzahn: diamond,zotero,reprepro: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249670 [01:22:49] (03CR) 10Dzahn: [C: 032] diamond,zotero,reprepro: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249670 (owner: 10Dzahn) [01:32:18] (03PS1) 10Dzahn: contint: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249672 [01:33:17] (03PS2) 10Dzahn: contint: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249672 [01:33:24] (03CR) 10Dzahn: [C: 032] contint: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249672 (owner: 10Dzahn) [01:41:10] (03PS1) 10Aaron Schulz: [WIP] Re-enabled sidebar cache per 47eb083a0fe4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249674 [01:42:37] (03PS1) 10Dzahn: gridengine: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249675 [01:45:03] AaronSchulz: the job queue blew up again: https://grafana-admin.wikimedia.org/dashboard/db/job-queue-health [01:45:07] worse than before [01:45:08] 3m jobs [01:47:10] (03PS1) 10John F. Lewis: partman: remove unused mailserver recipe [puppet] - 10https://gerrit.wikimedia.org/r/249677 [01:47:19] (03PS2) 10John F. Lewis: partman: remove unused mailserver recipe [puppet] - 10https://gerrit.wikimedia.org/r/249677 [01:47:38] (03PS1) 10Dzahn: labstore: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249678 [01:52:03] ori: look to be almost entirely wikibase-addUSagesForPage [01:52:31] 1mil in enwiki, 1.3mil in wikidatawiki [01:54:10] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [01:54:13] i have to go home, it's 7pm and i'm still here [01:54:20] ebernhardson: could you update the ticket with that information? [01:54:34] and feel free to revert any recent wikidata patches [01:54:42] because this is bullshit [01:54:59] if you can spot the culprit patch, I mean. [01:56:20] not sure which, but i'll poke a bit [01:58:18] wikidatawiki has cirrusSearchLinksUpdatePrioritized: 1023289 queued; 7 claimed (7 active, 0 abandoned); 0 delayed [01:58:43] hmm, indeed it does. i was looking at the monitor of what redis is working on at m [01:58:51] checking [01:59:57] and its climbing at about 50/s ... [02:03:25] AaronSchulz: can you kill 27315 on terbium by aude? [02:03:33] or i guess, you probably don't have root either [02:03:58] hello [02:03:59] anyone with root available? aude is running a re-index of cirrussearch on wikidata which i approved, but it looks to be the culprit here since its re-parsing [02:04:12] I guess I could kill it [02:04:15] oh [02:04:18] ebernhardson: AaronSchulz has root I think [02:04:20] * AaronSchulz was taking a dump of hhvm on mw1016 [02:04:35] I can help too if need be [02:04:37] the point of it was to get all the geo data in wikidata indexed into elasticsearch for wikidata [02:04:52] when i aproved it i thought it was just going to parse the things, not trigger these other backend jobs [02:04:52] shall I kill it, AaronSchulz? [02:04:59] othe other option is let it be, it will finish eventually [02:05:07] but its doing every page in wikidata [02:05:10] !log Restarted stuck hhvm on mw1016; dump at /tmp/hhvm.25097.bt [02:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:05:35] ebernhardson: is it actually a problem or will it just make a lot of jobs for a while? [02:05:43] AaronSchulz: it will just make a lot of jobs for a wihle [02:05:52] any idea how far along it is? [02:06:04] not sure [02:07:06] her screen sessoin says, is it possible for root to enter another users screen session? [02:07:45] yeah [02:08:12] that script usually prints out the % complete every couple % [02:08:19] 19864969 [02:08:30] doesn't have a ^ [02:08:33] % [02:08:39] "wikimedia-privacypage": "{{notranslate}} Default Wikimedia projects link to privacy policy. Should not be changed at the request of Wikimedia Foundation's Legal Department.", [02:08:40] heh [02:08:41] oh i guess since she's using the --fromID parameter [02:08:56] well, it started at 2.7M, its not at 19.8M, and it will finish at 23M [02:09:06] so its most of the way there already [02:09:23] [ wikidatawiki] Queued 10 pages ending at 19870014 at 113/second [02:09:30] the 113/s is constant [02:11:08] i suppose if its got this far and redis isn't going to run out of memory, we can just leave it? [02:13:00] that also only explains ~1M of the 3M jobs in the queue [02:14:00] looks like ~1M is normal [02:15:10] 5.8 out of ~54G avail on rdb1001 [02:15:17] so not running out anytime soon [02:16:14] similar figures for rdb1003 as I'd expect [02:17:17] YuviPanda: btw I think the vmovercommit thing needs a restart [02:17:30] aaron@rdb1001:~$ uptime [02:17:30] 02:16:58 up 958 days, 7:08, 1 user, load average: 1.26, 0.79, 0.72 [02:17:32] AaronSchulz: on redis? yeah [02:17:42] AaronSchulz: wait I think it just needs a restart of redis no? [02:17:44] not of the machine [02:17:49] did we merge it? we haven't I think [02:18:08] * YuviPanda gets out of screen carefully [02:18:25] would be nicer to restart after https://phabricator.wikimedia.org/T89400 heh [02:19:06] 17602 redis 20 0 20.1g 7.3g 1188 R 65 10.3 60315:43 redis-server [02:19:20] at 65% cpu all the time means restarting one would put the other at saturation [02:19:39] and if the restart goes bad on the off chance, then it's kind of fucked [02:19:58] yeah [02:20:08] that's one of the reasons I haven't really touched it [02:20:10] I guess you'd have to turn off jobchron first or something [02:20:12] to lower the load [02:20:21] redis also takes forever [02:20:23] to restart [02:20:30] feeding in the aof, right [02:21:40] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [02:23:52] YuviPanda: you want to work on https://phabricator.wikimedia.org/T89400 btw ? [02:24:18] * AaronSchulz could also pester robh ;) [02:24:34] AaronSchulz: yeah, robh seems like the right choice :D [02:24:46] AaronSchulz: probably needs an approval from mark too [02:25:21] 6operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1764616 (10yuvipanda) So this needs two new machines to be allocated, right? [02:25:36] AaronSchulz: I suppose we have to buy those machines? [02:26:41] 6operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1764618 (10aaron) [02:26:56] 6operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1035333 (10aaron) >>! In T89400#1764616, @yuvipanda wrote: > So this needs two new machines to be allocated, right? Yes. [02:27:24] I don't think there are similar spares [02:28:06] YuviPanda: though MW does support weights, so using a somewhat different spare (but still ssd, moderately high ram) can be an option if needed [02:28:24] yeah but that'll just confuse things... [02:28:28] let's just do https://wikitech.wikimedia.org/wiki/Operations_requests#Hardware_Requests [02:28:49] AaronSchulz: can you fill that out? [02:29:51] YuviPanda: well it's basically "a copy of the rdb* hardware" [02:30:21] hmm [02:35:30] PROBLEM - HHVM rendering on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.423 second response time [02:36:00] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:10] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 64346 bytes in 0.221 second response time [02:37:10] PROBLEM - pybal on lvs1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:37:30] PROBLEM - pybal on lvs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:37:40] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.174 second response time [02:37:49] PROBLEM - configured eth on lvs1008 is CRITICAL: eth3 reporting no carrier. [02:37:49] PROBLEM - pybal on lvs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:37:49] PROBLEM - configured eth on lvs1011 is CRITICAL: eth3 reporting no carrier. [02:38:01] PROBLEM - configured eth on lvs1010 is CRITICAL: eth3 reporting no carrier. [02:38:30] PROBLEM - pybal on lvs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:38:30] PROBLEM - pybal on lvs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:38:51] PROBLEM - pybal on lvs1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:39:45] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 10m 47s) [02:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:58] um [02:42:10] I should probaby look at these [02:42:24] are those lvs machines ok? [02:42:24] the mw* stuff has recovered but lvs hasn't [02:42:29] looking [02:44:12] ok pybal is dead [02:46:28] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-29 02:46:27+00:00 [02:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:09] pybal is dead and nothing in dmesg [02:47:27] eth3 is up too [02:47:38] Krenair: do you know where the 5xx graphs are? [02:47:49] um [02:47:51] Krinkle might [02:48:17] somewhere in gdash/graphite/ganglia/something else? [02:48:27] https://grafana-admin.wikimedia.org/dashboard/db/varnish-http-errors [02:48:46] I was just playing with them in fact [02:48:52] Added timeshifts for context [02:48:56] or grafana [02:49:00] I knew it was one of the 'g's [02:49:07] but couldn't remember the last one [02:49:18] gdash is gone, replaced with static mirror until migration is ready [02:49:31] graphite is data store, grafana frontend (simplified perception, but for practical purposes) [02:49:45] ok I don't see a 5xx increase [02:52:13] 7 to 12 have the same puppet define [02:52:37] 'row D subnets' [02:52:44] 'row D subnets on eth 3' [02:52:54] so maybe row D is dead [02:53:42] they're all in the same row and reporting the same network interface error? [02:53:59] they're all reporting error on one particular interface [02:54:02] which is connected to one row [02:55:21] did those alerts page ops? [02:55:25] no [02:55:34] it's 'CRITICAL' that marks ops being paged isn't it? [02:55:45] critical => true [02:55:51] which is different from nagios critica [02:55:53] l [02:56:07] pybal logging is useless [02:56:46] I'm attempting to find which hosts are on row D [02:56:51] and if I can't reach them I'll start paging [02:56:56] although this is like literally the worst time [02:57:02] east coast is alseep and europe hasn't woken up yet [02:58:27] there's a sharp spike in http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false [02:58:51] corresponding roughly to the alerts [02:59:37] !log start pybal on lvs1007 [02:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:48] but it came back down, whereas the alerts are presumably still active [03:00:10] RECOVERY - pybal on lvs1007 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [03:00:27] Krenair: yeah probably because pybal doesn't start itself back up [03:00:47] I started it back on lvs1007 and it's healthy (I see tcp go through) [03:00:59] but the eth3 alerts? [03:01:10] I'm not sure about those [03:01:21] even then that just removes one row out of the equation we still serve from the other 2 [03:02:12] according to the icinga web interface it's critical, eth3 reporting no carrier. [03:02:25] but duration 1 day [03:02:34] (lvs1008) [03:02:44] !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 06m 03s) [03:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:55] lol? [03:03:06] Krenair: so maybe it just died yesterday this time [03:03:10] and it was the daily re-alert? [03:03:34] specifically 1d 10h 49m 27s since I loaded the page [03:03:42] before I* [03:03:48] ok then that theory isn't true [03:04:32] puppet is also disabled [03:04:59] > Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'under provisioning, ask faidon/bblack'); [03:05:01] heh [03:05:03] ok [03:05:06] I'll file a task anyway [03:05:52] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-10-29 03:05:52+00:00 [03:05:55] wouldn't it be normal to acknowledge those? [03:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:23] yes [03:07:53] 6operations, 10Traffic: Investigate PyBal dead on lvs1007-12 - https://phabricator.wikimedia.org/T117015#1764639 (10yuvipanda) 3NEW [03:08:22] 6operations, 10Traffic: Investigate PyBal dead on lvs1007-12 - https://phabricator.wikimedia.org/T117015#1764647 (10yuvipanda) I started pybal on 1007 and it seems ok... [03:09:14] 6operations, 10Traffic: Investigate PyBal dead on lvs1007-12 - https://phabricator.wikimedia.org/T117015#1764649 (10yuvipanda) Aaah, and puppet is disabled on all of those hosts with: `Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'under provisioning, ask faidon/bblac... [03:17:30] Krenair: I hence declare this 'not an outage' [03:17:34] and will keep an eye on 5xx for a while [03:17:41] k [03:19:40] (03PS1) 10Krinkle: navtiming: Track responseStart in Graphite (Time to first byte) [puppet] - 10https://gerrit.wikimedia.org/r/249682 [03:53:00] (03CR) 10Ori.livneh: [C: 032] navtiming: Track responseStart in Graphite (Time to first byte) [puppet] - 10https://gerrit.wikimedia.org/r/249682 (owner: 10Krinkle) [03:55:45] !log cassandra *staging*: testing DateTieredCompactionStrategy (https://labs.spotify.com/2014/12/18/date-tiered-compaction/) on wikipedia html and data-parsoid tables [03:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:03:05] (03PS1) 10Alex Monk: Make mediawiki-config clone be owned by mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/249684 (https://phabricator.wikimedia.org/T117016) [04:12:16] 6operations, 10Traffic: Investigate PyBal dead on lvs1007-12 - https://phabricator.wikimedia.org/T117015#1764688 (10BBlack) 5Open>3Resolved a:3BBlack Sorry, that's my bad. These were all downtimed in icinga, but the downtimes expired which triggered the alerts to show. I've re-downtimed them (for a mon... [04:15:51] ori: did you know that we're responsible for about 40% (each) of the 'fuck's in our commit messages? [04:15:55] TIL! [04:18:18] actually no, I've 3 and you've 2, so I'm actually responsible for 50% [04:18:33] seems like a misuse of percentages :P [04:18:38] yeah :P [04:18:50] there's only a total of 6 of them [04:19:05] 3 Fucks doesn't sound as much as '50% of all fucks in the entire repo' [04:19:17] :) [04:19:36] (I used to be a 'journalist' for my school newspaper, I'm sure you can tell) [04:24:11] cirrus jobs on wikidata are now cleared out (pointed extra runners at it). but they will just start growing again now till the index operation finishes [04:25:06] also makes me wonder if job runners would be a good usage of VM's / autoscaling (iirc there was some talk of starting to use virtualization in prod) [04:26:16] there was vague talk of using kubernetes with autoscaling for tehse [04:26:21] super vague and far off tho [04:28:30] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [04:29:05] ^ can still ignore those, nobelium is creating and filling indices still [04:29:40] although, not sure why it would time out ... could be gc pauses or some such [04:32:19] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [04:39:13] hmm, yea old gc went from 5-10s up to 50s in the last 20 minutes, not sure why :( probably not a big deal [05:31:50] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [05:35:29] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [05:37:48] !log restarting elasticsearch on nobelium [05:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:38:04] !log restarting elasticsearch on nobelium to attempt to clear up extra log GC pauses in the old generation (50s+) [05:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:38:20] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [05:39:30] nobelium will probably complain about bad status now until it reloads all the indices... [05:40:49] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch inactive shards 1238 threshold =0.1% breach: status: red, number_of_nodes: 1, unassigned_shards: 1235, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 549, cluster_name: labsearch, relocating_shards: 0, active_shards: 549, initializing_shards: 3, number_of_data_nodes: 1, delayed_unassigne [05:46:18] !log finished manually running 3M enwiki/enqueue, enwiki/wikibase-addUsagesForPage, and wikidatawiki/cirrusSearchLinksUpdatePrioritized jobs from mw1011 and mw1012 [05:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:49:49] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: red, number_of_nodes: 1, unassigned_shards: 24, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1760, cluster_name: labsearch, relocating_shards: 0, active_shards: 1760, initializing_shards: 3, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [05:51:29] ebernhardson: thanks, that was pretty heroic [05:51:41] ori: shrug, just turn it on and wait :) [05:52:12] but it makes me thing, the job runner could just do it itself, since it didn't need extra resources and wsa barely visible in ganglia as extra load [05:53:00] the enqueue and wikibase-addUsagesForPage jobs all took <10ms each [05:57:34] the wikibase ones are tricky though ... they ran so fast started hitting the slave-lag limits built into the job runner and had to scale it back (to 15 runners instead of 30 as i did for the others) [05:57:44] well, at least tricky to handle right programatically [06:17:33] Can someone please silence some catchpoint alerts for me? [06:17:51] We know the service is down, and I'd rather not get spammed all night, in case something else important happens. [06:18:34] please PM me so I can send the URLs [06:19:22] (03PS1) 10EBernhardson: Add configuration for jobs per invocation [puppet] - 10https://gerrit.wikimedia.org/r/249686 [06:19:24] alternatively: can anyone tell me who owns the catchpoint stuff? [06:25:10] (03Abandoned) 10EBernhardson: Add configuration for jobs per invocation [puppet] - 10https://gerrit.wikimedia.org/r/249686 (owner: 10EBernhardson) [06:27:28] (03PS1) 10Aaron Schulz: Fix broken boilerplate maxjobs default in RunJobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249687 [06:28:27] (03CR) 10Aaron Schulz: [C: 032] Fix broken boilerplate maxjobs default in RunJobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249687 (owner: 10Aaron Schulz) [06:28:49] (03Merged) 10jenkins-bot: Fix broken boilerplate maxjobs default in RunJobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249687 (owner: 10Aaron Schulz) [06:28:59] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:29:30] chasemp: Hi, sorry to pull the emergency lever here... My question is a page up in backscroll ^ [06:30:10] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:15] !log aaron@tin Synchronized rpc/RunJobs.php: 29ccbd248 (duration: 00m 17s) [06:30:20] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:29] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:33] We're doing an overnight schema migration, and the hourly catchpoint spam is unnerving [06:30:50] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] (03PS1) 10EBernhardson: [rpc] Run as many jobs as fit in the time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249688 [06:31:09] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:25] (03Abandoned) 10EBernhardson: [rpc] Run as many jobs as fit in the time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249688 (owner: 10EBernhardson) [06:31:40] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:51] AaronSchulz: the runners have no immediatly hit the slave-lag-limit :) [06:33:02] (not sure what to do about that, or just leave it be) [06:33:10] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:18] ebernhardson: are you still running jobs? [06:33:29] AaronSchulz: commonswiki ones that only talk to ES [06:34:01] its not a big deal, was just wondering why after clearing out the big lists there were still a million jobs, and it turned out 250k of them were more cirrus jobs hanging around [06:34:45] they are growing rather than declining, which is the only thing that made me wonder [06:36:01] awight: still in need of silencing help? [06:36:12] I can probably dig it up and do it if you want [06:36:14] there are also 900k refreshLinksDynamic in commonswiki, thats the bulk of whats filling the job queue right now. But i don't know what they do so not touching them :) [06:36:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/2: down - Core: cr1-ulsfo:xe-0/0/3 GTT/TiNet (02773-004-32) [2Gbps MPLS]BR [06:37:00] um [06:37:01] YuviPanda: that would be great. PMming [06:37:07] awight: kk cool. [06:37:19] paravoid: around? I guess this is the MPLS circuit? [06:37:26] and it's ok for it to be down? [06:40:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 1, unused: 0 [06:40:55] YuviPanda: <3 just saw your email "Anyone finding the catch point emails to ops@ useful" [06:40:58] seriously. [06:42:43] awight: you should respond :D [06:43:10] !log disabled two fr-tech related catchpoint tests per awight's request [06:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:43:42] YuviPanda: Thanks, that was a huge favor! [06:43:54] awight: <3 yw. [06:44:02] awight: remember to turn it back on though. [06:44:25] awight: how frequent was the spam? [06:44:33] I want to make sure that actually worked before I hit the bed [06:44:37] and Catchpoint has shitty UI [06:45:35] It's hourly, don't even sweat it if the config didn't stick [06:45:49] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:55] awight: ok! [06:46:06] crikey, no it was every 20 minutes [06:46:09] haha [06:46:11] lol [06:46:21] yeah I think it might've been 15 or 20mins [06:46:28] gah and the other was 10 minutes [06:46:31] hehe [06:46:33] dastardly little thing [06:46:34] easy to verify then [06:46:39] I'll go brush and be back [06:46:50] see ya! [06:48:41] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [06:53:37] awight: all good? [06:54:11] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:56:12] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 29 07:04:04 UTC 2015 (duration 4m 3s) [07:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:08:06] YuviPanda: /me falls into a quiet reverie [07:08:17] awight: awesome [07:08:20] Haven't heard a peep since you put the 1,000 kg weight on it! [07:08:26] I shall too fall into a quiet reverie [07:08:28] Thanks again! [07:08:35] somehow that word reminds me of the SF - Chicago train trip [07:08:40] awight: np. good night! [07:08:50] gah. I really enjoyed that track, too [07:09:14] Ran out of homemade food by that point though, which soured the ride so to speak [07:09:37] yeah the train food is shit [07:09:47] awight: but in Denver you can run out and get pizza and run back [07:09:50] so that was ok [07:09:54] and then we had lots of junk foo [07:09:56] d [07:10:02] me too! and a 24 pack of crappy beer [07:10:16] awight: hah! I wasn't sure if drinking was legal in the train so didn't bring any [07:10:18] mistake [07:10:21] ;) [07:10:22] but other fun things made up for it [07:10:34] I shall go to sleep thinking of the salt plains now! [07:10:43] my seatmate was killing this enormous backpack of incredibly strong THC brownies [07:10:51] I had no idea for like 3 days. [07:11:07] hahaha [07:11:12] niceeeee [07:11:15] rando northern CA gangster [07:11:19] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:11:22] sounds like [07:11:30] I hope that person didn't run out of food [07:11:57] I think I figured it out when she started telling long stories about DEA busts and repeatedly muttering that the old guy with the tight T-shirt that read, "ARMY" was actually in the CIA [07:12:54] awight: hehe, did she come by to the observation car when some sort of veterans group shows up to be a voice over? [07:13:10] Anyway, great way to travel. I wrote the only letters of my life the whole time, and kept running out in small towns to find a mailbox [07:13:18] a mutter-over [07:14:12] nice [07:14:15] I should do that again [07:14:19] once my travel ban is over [07:14:23] wat [07:14:31] it's a self imposed travel ban [07:14:40] my last europe trip was supposed to be 6 days [07:14:41] aah. [07:14:44] ended up being about 31 [07:14:52] so tempting to just add another week y'know :D [07:14:54] I was just about to say, try a boxcar next time [07:15:04] so self imposed travel ban for 6 months so I can experience living in one place [07:15:08] what's a boxcar? [07:15:08] I guess I should avoid setting you off, though [07:15:16] aah [07:15:25] awight: nah I'm pretty solidly nailed down [07:15:39] awight: no train hopping/surfing for me! I hear that can get pretty brutal [07:15:44] have friends of friends who used to do that [07:16:20] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1764797 (10ArielGlenn) I don't want the salt master to be unavailable if we have problems with ganeti; I also want to guarantee access to a minimum amount of cpu resources, which does not ha... [07:16:30] * awight pines for the days of standby flights [07:16:35] ebernhardson: ori still around? [07:16:49] awight: heh :) [07:17:06] awight: apparently the 'american way' is to buy a car but that's terrifying. plus can't enjoy anything if you're actually driving [07:18:00] Driving sucks. The only advantage is that you can sleep in state parks rather than a barely inclined foam coffin [07:18:21] I totally slept ok on the seats in the train [07:18:24] much better than airline seats [07:18:31] I can still sleep ok on airline seats too [07:18:45] but yeah, national parks sounds fun [07:18:52] man of mettle :p [07:19:08] awight: nah, 4years of sleeping in class... [07:19:19] aude: your thing is just fine, it turned out to be a problem with the job queue since it was switched over to the rpc mechanism [07:19:35] since we figured that out, the job queue is now dropping nicely to sane levels :) [07:20:25] i see [07:20:30] awight: we should both go to sleep :) [07:20:35] * YuviPanda goes [07:20:45] that's what i thought, since my thing has been running since day before yesterday [07:21:04] and then i saw a big peak a few hours ago [07:21:21] at timing unrelated [07:21:38] the reindexing should be done shortly :) [07:21:43] sweet! [07:22:04] * aude heads to the office [07:25:29] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:25:30] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:26:21] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:26:29] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:26:31] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:26:39] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:40] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:27:20] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:30] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:31] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:00] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:09] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:10] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:29] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:31:54] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1764810 (10Deskana) I was previously asked to approve access requests from WMDE moving forwards. So, approved. [07:47:12] (03PS1) 10Giuseppe Lavagetto: maintenance: include scap scripts in the role, they are needed. [puppet] - 10https://gerrit.wikimedia.org/r/249693 [07:47:53] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249693 (owner: 10Giuseppe Lavagetto) [07:52:40] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:54:33] (03PS10) 10ArielGlenn: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [07:55:41] (03CR) 10ArielGlenn: [C: 032] Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [08:01:20] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail [08:02:54] yeah, ignore that, I'm working on it [08:04:37] (03PS1) 10Giuseppe Lavagetto: maintenance: actually ensure updatequerypages crons [puppet] - 10https://gerrit.wikimedia.org/r/249694 [08:10:11] (03PS1) 10ArielGlenn: fix typo in snapshots role class [puppet] - 10https://gerrit.wikimedia.org/r/249695 [08:11:17] (03CR) 10ArielGlenn: [C: 032] fix typo in snapshots role class [puppet] - 10https://gerrit.wikimedia.org/r/249695 (owner: 10ArielGlenn) [08:13:14] (03PS2) 10Giuseppe Lavagetto: maintenance: actually ensure updatequerypages crons [puppet] - 10https://gerrit.wikimedia.org/r/249694 [08:14:00] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:21:43] (03PS3) 10Giuseppe Lavagetto: maintenance: actually ensure updatequerypages crons [puppet] - 10https://gerrit.wikimedia.org/r/249694 [08:25:36] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: actually ensure updatequerypages crons [puppet] - 10https://gerrit.wikimedia.org/r/249694 (owner: 10Giuseppe Lavagetto) [08:39:56] (03CR) 10Muehlenhoff: base: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249666 (owner: 10Dzahn) [08:41:22] (03PS2) 10Jcrespo: Enable performance_schema on db1065 (and user_stats = 0) [puppet] - 10https://gerrit.wikimedia.org/r/249480 [08:47:21] (03CR) 10Jcrespo: [C: 032] Enable performance_schema on db1065 (and user_stats = 0) [puppet] - 10https://gerrit.wikimedia.org/r/249480 (owner: 10Jcrespo) [08:52:50] (03PS1) 10Jcrespo: Depool db1065 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249696 [08:52:51] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [08:54:58] (03PS2) 10Muehlenhoff: Include base::firewall in the mariadb::labsdb role [puppet] - 10https://gerrit.wikimedia.org/r/245958 [08:56:04] (03CR) 10Jcrespo: [C: 031] Include base::firewall in the mariadb::labsdb role [puppet] - 10https://gerrit.wikimedia.org/r/245958 (owner: 10Muehlenhoff) [08:58:17] (03CR) 10Jcrespo: [C: 032] Depool db1065 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249696 (owner: 10Jcrespo) [08:59:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Include base::firewall in the mariadb::labsdb role [puppet] - 10https://gerrit.wikimedia.org/r/245958 (owner: 10Muehlenhoff) [08:59:44] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 for maintenance (duration: 00m 17s) [08:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:06] (03PS2) 10Mobrovac: RESTBase: Strip redundant headers from back-end services [puppet] - 10https://gerrit.wikimedia.org/r/249465 (https://phabricator.wikimedia.org/T116911) [09:04:23] I've depooled 1 out of 2 enwiki API servers, ping me if you see something strange on API activity [09:04:44] (edit api, not rest api) [09:07:26] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1764873 (10mobrovac) [09:07:28] 6operations, 10RESTBase, 6Services, 5Patch-For-Review: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1764871 (10mobrovac) 5Resolved>3Open Reopening, as this needs the config change to be deployed as well. [09:13:01] !log restarting db1065 for regular maintenace [09:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:46] (03CR) 10Muehlenhoff: "The idea here to provide test instances of external distros for CI tests, right?" [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [09:20:00] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:06] <_joe_> Major power failure at my home. Ping me here if I am needed [09:22:33] we have a problem here, not creating great issues, but a scary for the future [09:22:49] some mediawikis have not been synced, despite tin confirming so [09:23:22] tin said "sync-common: 100% (ok: 467; fail: 0; left: 0)" [09:24:24] but mw1083 has the old config and it is sending queries to a db that is not pooled [09:25:10] (03PS1) 10Alexandros Kosiaris: Revert "maps: Add tileratorui service" [puppet] - 10https://gerrit.wikimedia.org/r/249697 (https://phabricator.wikimedia.org/T116062) [09:26:03] 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1764875 (10hashar) Thanks for all the attention. The pbuider hook defines `jessie-wikimedia/backports` but the packages are in upstream `jessie-... [09:27:41] I see T116184 [09:29:27] and confirming it is not "real" traffic, so not an issue [09:31:39] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5000000.0] [09:32:26] <_joe_> jynus: did you verify that mw1083 is in the dsh group for scap? [09:32:34] it is out [09:32:46] <_joe_> sorry disconnecting/reconnecting [09:32:56] <_joe_> jynus: did we get an alert for it? we should [09:33:08] * aude vaguely recalls issues with mw1083 recently [09:33:11] and it was depooled [09:33:14] yes [09:33:27] <_joe_> yes it has the disk in read-only mode I guess [09:33:29] it is on the ticket I found [09:33:30] would be good to find out why and fix [09:33:34] yep [09:33:37] <_joe_> so we might just want to turn it off [09:34:02] I worried because of the logs [09:34:12] but they do not come from real traffic [09:34:36] I can switch it off later, let me first boot a real server [09:34:44] which is failing to do so [09:34:50] and it is way more important [09:35:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [09:39:23] (03CR) 10Hoo man: [C: 031] admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) (owner: 10Dzahn) [09:40:25] (03CR) 10Hashar: "A note: I have no idea what debdeploy is." [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [09:40:41] (03PS1) 10Aude: Display labels in mobile search on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249699 [09:40:56] moritzm: Guten Tag. I replied to CI unattended upgrade stuff at https://gerrit.wikimedia.org/r/243925 :) [09:44:37] sure, as mentioned it makes sense, simply wanted to clarify daniel's concern/remark. I'll let him review the actual patch, currently busy with other things [09:44:38] (03PS1) 10Giuseppe Lavagetto: maintenance: move jobqueue_status to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/249700 (https://phabricator.wikimedia.org/T116728) [09:44:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [09:44:40] (03PS1) 10Giuseppe Lavagetto: maintenance: run purge_securepoll once, not every minute for one hour [puppet] - 10https://gerrit.wikimedia.org/r/249701 [09:44:42] (03PS1) 10Giuseppe Lavagetto: maintenance: move purge_securepoll off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249702 (https://phabricator.wikimedia.org/T116728) [09:44:44] (03PS1) 10Giuseppe Lavagetto: maintenance: move purge_checkuser off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249703 (https://phabricator.wikimedia.org/T116728) [09:44:46] (03PS1) 10Giuseppe Lavagetto: maintenance: make purge_abusefilter run daily, not minutely for 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/249704 [09:44:48] (03PS1) 10Giuseppe Lavagetto: maintenance: move purge_abusefilter off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249705 (https://phabricator.wikimedia.org/T116728) [09:44:50] (03PS1) 10Giuseppe Lavagetto: maintenance: move update special pages off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249706 (https://phabricator.wikimedia.org/T116728) [09:44:52] (03PS1) 10Giuseppe Lavagetto: maintenance: move tor jobs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249707 [09:44:54] (03PS1) 10Giuseppe Lavagetto: maintenance: move upload stash off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249708 (https://phabricator.wikimedia.org/T116728) [09:44:56] (03PS1) 10Giuseppe Lavagetto: maintenance: move parser cache purging off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249709 (https://phabricator.wikimedia.org/T116728) [09:45:07] ok, it came back, now I can check mw1083 [09:47:42] it is pure hardware error [09:48:02] ATA errors [09:49:39] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1764902 (10jcrespo) I will shutdown this machine now so it does not query the mysql servers with an outdated configuration. [09:50:10] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [09:51:04] !log shutdown mw1083 to avoid querying the mysql servers with an outdated config/spaming the error logs [09:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:48] ^aude, it just needs a disk replacement, and has already been reported [09:52:02] jynus: ok [09:52:05] (03PS2) 10Aude: Fetch labels in mobile api queries on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249699 [09:52:12] * aude needs to deploy a config change [09:52:17] chris is very fast doing that [09:52:22] greg knows about it and it's ok [09:52:41] PROBLEM - Host mw1083 is DOWN: PING CRITICAL - Packet loss = 100% [09:52:48] ups [09:53:21] (03CR) 10Hashar: "I have filled the puppet failure on deployment-fluorine as T117028" [puppet] - 10https://gerrit.wikimedia.org/r/240334 (owner: 10ArielGlenn) [09:54:34] aude: don't quote me, but I think the gentleman agreement is that WMDE people knows to be cautious when deploying and are more or less part of #releng deployers :-} [09:54:38] (03CR) 10Filippo Giunchedi: [C: 04-1] cache: vary statsd_server with hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [09:54:58] hashar: yep [09:55:12] aude: still good to warn ops about the deploy anyway :-} [09:55:15] i just didn't put it on the calendar though will be quick [09:56:01] (03CR) 10Hoo man: [C: 031] Fetch labels in mobile api queries on wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249699 (owner: 10Aude) [09:57:11] (03CR) 10Aude: Fetch labels in mobile api queries on wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249699 (owner: 10Aude) [09:57:42] (03CR) 10Aude: [C: 032] "this stuff works good on test.wikidata, and should be good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249699 (owner: 10Aude) [09:57:48] (03Merged) 10jenkins-bot: Fetch labels in mobile api queries on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249699 (owner: 10Aude) [09:58:48] (03CR) 10Alexandros Kosiaris: [C: 031] "Arghh!!! It was like that? sigh" [puppet] - 10https://gerrit.wikimedia.org/r/249701 (owner: 10Giuseppe Lavagetto) [09:59:04] (03CR) 10Alexandros Kosiaris: "Arghh!!! It was like that? sigh" [puppet] - 10https://gerrit.wikimedia.org/r/249701 (owner: 10Giuseppe Lavagetto) [10:00:10] !log aude@tin Synchronized wmf-config/Wikibase.php: Config for fetching labels on mobile wikidata (duration: 00m 18s) [10:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:00:59] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable nearby on wikidata (duration: 00m 18s) [10:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:30] * aude verifies [10:01:38] (03CR) 10Hashar: "I noticed that `statsd` and thought we might want to be able to set different statsd server. Additionally I was to lazy to verify whether " [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [10:01:40] * hoo tests [10:01:55] Looks good to me [10:02:20] Also there's weird stuff nearby... am in a train in the middle of nowhere. [10:03:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:04:37] !log removed openjdk-7 from cassandra test hosts (now using openjdk-8) [10:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:18] hoo: heh :) [10:08:14] (03PS2) 10Hashar: cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) [10:09:34] (03CR) 10Hashar: "Switched to use hiera('statsd')." [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [10:10:52] aude: do you have anything else to deploy ? [10:11:03] aude: we would like to restart Jenkins [10:11:09] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:13:06] hashar: We are fine for now, I think... will probably need to push out more stuff [10:13:08] but not right now [10:13:10] mobrovac: bonjour. Looking at logstash I noticed a spam of EMERGENCY level errors with: Error: ENOENT, open '/srv/deployment/restbase/deploy/restbase/lib/../specs/analytics/v1/pageviews.yaml' [10:13:34] hoo: aude ok restart Jenkins :) [10:13:42] hashar: prod logstash? [10:14:03] !log Upgrading java on gallium and restarting Jenkins [10:14:04] mobrovac: yeah [10:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:20] mobrovac: head to the default dashboard and filter on EMERGENCY [10:14:36] kk thnx hashar [10:14:45] (03PS3) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) [10:15:19] ah those are the test hosts hashar [10:15:28] still, why are they missing that file? [10:15:32] (03PS1) 10Jcrespo: Repool db1065 after maintenance (with lower weight than normal) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249713 [10:16:02] (03PS2) 10Jcrespo: Repool db1065 after maintenance (with lower weight than normal) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249713 [10:16:39] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:16:56] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move jobqueue_status to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/249700 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [10:19:17] (03CR) 10Jcrespo: [C: 032] Repool db1065 after maintenance (with lower weight than normal) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249713 (owner: 10Jcrespo) [10:19:25] mobrovac: filled it at https://phabricator.wikimedia.org/T117029 [10:19:41] kk i think i just realised why [10:19:48] thnx for the heads-up hashar! [10:20:03] I am not sure why we have a test host in prod ... but then :-} [10:20:08] it is not a perfect world [10:20:23] eh hashar, you want to know too much [10:20:24] hahaha [10:20:28] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1764923 (10fgiunchedi) we've attempted this yesterday with https://gerrit.wikimedia.org/r/#/c/249374/ to get `base::service_unit` to stop declaring `service` resource (rev... [10:20:36] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1764924 (10akosiaris) Let's start by the fact that this has happened exactly one up to now, in the 4 months the service has been active. It was... [10:20:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 after maintenance (duration: 00m 18s) [10:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:10] !log restarting Jenkins (java upgrade) [10:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:22:09] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:23:53] (03PS1) 10Giuseppe Lavagetto: maintenance: actually use the ensure parameter in refreshlinks crons [puppet] - 10https://gerrit.wikimedia.org/r/249714 [10:24:13] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249714 (owner: 10Giuseppe Lavagetto) [10:31:07] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1764940 (10daniel) Fwiw, the Wikidata team would benefit quite a lot from Jan having deployment rights. Currently, Aude is the only full-time member of the team who has access (H... [10:31:46] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM but please verify with the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [10:32:43] (03PS1) 10Jcrespo: Repool db1065 at 100% load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249716 [10:33:18] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1764941 (10Lydia_Pintscher) +1 to what Daniel said. [10:34:16] <_joe_> !log migrated jobqueue_stats_reporter to mw1152 [10:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:03] godog: do you have any varnish hostname I could use to puppet compile the statsd change? [10:35:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:36:54] (03PS2) 10Giuseppe Lavagetto: maintenance: run purge_securepoll once, not every minute for one hour [puppet] - 10https://gerrit.wikimedia.org/r/249701 [10:37:16] hashar: sure, cp1043.eqiad.wmnet [10:38:42] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1764943 (10jcrespo) [10:38:57] took a random one from codfw as well cp2023.codfw.wmnet [10:39:13] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: run purge_securepoll once, not every minute for one hour [puppet] - 10https://gerrit.wikimedia.org/r/249701 (owner: 10Giuseppe Lavagetto) [10:39:24] (03PS2) 10Jcrespo: Repool db1065 at 100% load. Reduce db1035 weight. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249716 [10:39:43] though I have no idea whether we have a statsd proxy / host in codfw [10:40:12] we do not [10:40:30] (03CR) 10Jcrespo: [C: 032] Repool db1065 at 100% load. Reduce db1035 weight. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249716 (owner: 10Jcrespo) [10:41:25] (03CR) 10Hashar: "Compile result is at https://puppet-compiler.wmflabs.org/1118/" [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [10:41:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 at 100% load. Reduce db1035 weight. (duration: 00m 17s) [10:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:12] (03CR) 10Hashar: "There is no hurry for beta-cluster, so feel free to schedule that change later on in case it causes some madness on production. That might" [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [10:42:26] godog: all good to me apparently. There is no hurry for beta cluster [10:42:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:43:17] godog: an unrelated question: could we get the production grafana to use the labs statsd server as a backend? Would let us have labs and prod dashboards at the same place. Not sure it is doable though [10:44:18] hashar: so re: merging looks like a candidate for puppet swat? re: labs I believe we can, it can be added as another backend [10:45:03] godog: puppet swat probably, will let brandon look at it. I can't attend the pupetswat :( bad timing family wise and conflict with another meeting [10:45:24] for Grafana/labs, I will think a bit more about it. I am not sure I actually have a use case :D [10:45:53] beside adding the ability to easily reuse the prod dashboard and point them to the labs graphite so we can share the same boards between prod and beta [10:46:40] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1764945 (10fgiunchedi) to recap what we're trying to do: * puppet should install systemd files but don't ensure any stopped/running state on the service itself * we want t... [10:48:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:53:10] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5496 bytes in 0.011 second response time [10:53:40] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5496 bytes in 0.009 second response time [10:54:25] (03PS1) 10Filippo Giunchedi: base: reflow and expand parameters documentation [puppet] - 10https://gerrit.wikimedia.org/r/249717 [10:55:36] (03CR) 10Giuseppe Lavagetto: [C: 032] base: reflow and expand parameters documentation [puppet] - 10https://gerrit.wikimedia.org/r/249717 (owner: 10Filippo Giunchedi) [10:55:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:56:42] (03CR) 10Chmarkine: [C: 031] ssl_ciphersuite: add ECDHE+3DES options [puppet] - 10https://gerrit.wikimedia.org/r/249017 (owner: 10BBlack) [10:59:53] (03PS2) 10Giuseppe Lavagetto: maintenance: make purge_abusefilter run daily, not minutely for 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/249704 [11:00:07] <_joe_> akosiaris: another one ^^ [11:03:50] (03PS2) 10KartikMistry: Apertium: Add missing apertium-br-fr [puppet] - 10https://gerrit.wikimedia.org/r/249376 (https://phabricator.wikimedia.org/T102101) [11:04:47] godog: can you merge simple patch: https://gerrit.wikimedia.org/r/#/c/249376/ ? [11:06:24] kart_: yep [11:06:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Apertium: Add missing apertium-br-fr [puppet] - 10https://gerrit.wikimedia.org/r/249376 (https://phabricator.wikimedia.org/T102101) (owner: 10KartikMistry) [11:07:22] godog: thanks! [11:07:47] np [11:08:00] (03PS2) 10Filippo Giunchedi: base: reflow and expand parameters documentation [puppet] - 10https://gerrit.wikimedia.org/r/249717 [11:08:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] base: reflow and expand parameters documentation [puppet] - 10https://gerrit.wikimedia.org/r/249717 (owner: 10Filippo Giunchedi) [11:08:31] (03PS3) 10Giuseppe Lavagetto: RESTBase: Strip redundant headers from back-end services [puppet] - 10https://gerrit.wikimedia.org/r/249465 (https://phabricator.wikimedia.org/T116911) (owner: 10Mobrovac) [11:09:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249465 (https://phabricator.wikimedia.org/T116911) (owner: 10Mobrovac) [11:10:31] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: PID not expanded in heap dumps - https://phabricator.wikimedia.org/T116814#1764970 (10fgiunchedi) change has been merged, pending rolling restart [11:10:41] <_joe_> mobrovac: running puppet on cerium, praseodymium, xenon [11:11:16] kk [11:12:00] <_joe_> viva salt! [11:12:13] <_joe_> cssh ftw, for heaven's sake [11:12:36] <_joe_> mobrovac: done [11:13:03] kk, restarting RB there [11:13:59] Hi ops team, any idea why kafka1018 is feeling bad ? [11:14:02] <_joe_> mobrovac: one thing I don't know is why just a handful of rb hosts do fail [11:14:10] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [11:14:16] yay ^^ [11:14:24] _joe_: fail? [11:14:35] <_joe_> that health check [11:14:40] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [11:14:43] <_joe_> just a few of them did [11:14:50] <_joe_> but yay recoveries :) [11:14:54] hm [11:16:39] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [11:17:01] godog: i'll soon perform a rolling restart of RB, fyi [11:17:06] <_joe_> joal: I am busy with mobrovac atm [11:17:21] <_joe_> but I will look into kafka afterwards if no one can do it now [11:17:26] _joe_: kk, we can go ahead and force puppet on prod boxes [11:17:31] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [11:17:49] <_joe_> mobrovac: ok 1 sec [11:18:02] kk [11:18:46] mobrovac: ok! [11:18:50] joal: I'll take a look [11:19:01] _joe_: Thanks, trying to understanding why 1018 is going mad [11:19:29] <_joe_> thanks godog :) [11:19:50] RECOVERY - Restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [11:20:07] _joe_: Bytes-in spike, don't know why [11:20:31] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1764976 (10Joe) @godog I think this is fair. [11:20:39] RECOVERY - Restbase root url on restbase-test2003 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.109 second response time [11:23:26] <_joe_> mobrovac: done [11:23:31] thnx _joe_ [11:23:36] <_joe_> mobrovac: should I restart rb as well? [11:24:00] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [11:24:03] _joe_: doing it already [11:24:21] !log restbase rolling-restart after config change https://gerrit.wikimedia.org/r/249465 [11:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:25:46] joal: looks like a 2.8B replica lag spike at 9:28, which unit is that btw? bytes? [11:25:49] _joe_: looking at https://grafana.wikimedia.org/dashboard/db/kafka, it seems the lag is reducing [11:25:53] I'm looking at https://grafana.wikimedia.org/dashboard/db/kafka [11:26:19] Seems we are all looking at the same thing :) [11:26:49] RECOVERY - Restbase endpoints health on restbase2001 is OK: All endpoints are healthy [11:26:49] RECOVERY - Restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:26:50] RECOVERY - Restbase endpoints health on restbase2006 is OK: All endpoints are healthy [11:26:50] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [11:26:50] RECOVERY - Restbase endpoints health on restbase2004 is OK: All endpoints are healthy [11:26:50] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [11:26:50] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [11:26:51] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [11:26:51] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [11:27:18] <_joe_> I was actually looking at ganglia, there is a huge incoming network spike around 9:30 [11:27:25] <_joe_> which didn't stop until now [11:27:49] RECOVERY - Restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:27:49] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [11:28:09] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [11:28:09] <_joe_> sadly I was out for the power outage, so I have no backscroll before then [11:28:17] <_joe_> but this is definitely ongoing [11:28:47] The broker log size drop could be something about 1 disk failure ? [11:29:39] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:29:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [11:30:05] (03PS1) 10Muehlenhoff: openldap: Abide $datadir [puppet] - 10https://gerrit.wikimedia.org/r/249718 [11:30:16] _joe_, godog : I think the network spike is 1018 catching up on replica lag (using data from 1014 from what I see) [11:30:48] Now, why the replica lag? Borker log size dropped a bit at the given time [11:31:09] So my thought about a disk failure [11:31:11] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:31:19] <_joe_> joal: no disk failure [11:31:23] hmm [11:31:37] <_joe_> joal: http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=kafka1018.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Analytics+Kafka+cluster+eqiad [11:31:41] <_joe_> this is the problem [11:32:05] <_joe_> so something is syphoning a ton of data into kafka [11:32:09] hm, I think this is the result of the problam (1018 catching back up on lag) [11:32:11] akosiaris: puppet fail on sca1002 ^^ [11:32:43] _joe_: https://grafana.wikimedia.org/dashboard/db/kafka?panelId=9&fullscreen [11:32:57] <_joe_> joal: that mostly sounds right [11:33:16] <_joe_> so maybe kafka dropped all of its messages on that machine for some reason [11:33:21] <_joe_> lemme look at the logs from then [11:33:36] I think that's the thing, but why? hm ... [11:34:33] <_joe_> [2015-10-29 09:28:23,736] 5422066055 [kafka-scheduler-3] INFO kafka.log.OffsetIndex - Deleting index /var/spool/kafka/g/data/webrequest_upload-0/00000000030583928239.index.deleted [11:34:36] thanks _joe_ [11:34:37] <_joe_> [2015-10-29 09:28:23,736] 5422066055 [kafka-scheduler-3] INFO kafka.log.Log - Deleting segment 30616856452 from log webrequest_upload-0. [11:34:41] <_joe_> so yes confirmed [11:34:48] <_joe_> about the why, lemme look again :) [11:35:21] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [11:36:06] indeed, looks like a whole lot of deletes for webrequest_upload [11:36:48] godog: any idea of the reason ? [11:37:20] <_joe_> joal: no idea [11:37:25] ok :) [11:37:28] [2015-10-29 09:27:12,606] 5421994925 [ReplicaFetcherThread-5-14] WARN kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-5-14], Replica 18 for partition [webrequest_upload,0] reset its fetch offset from 28493996399 to current lead [11:37:32] At least it seems to be catching back up [11:37:34] er 14's start offset 28493996399 [11:37:35] [2015-10-29 09:27:12,606] 5421994925 [ReplicaFetcherThread-5-14] ERROR kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-5-14], Current offset 31062784634 for partition [webrequest_upload,0] out of range; reset offset to 284939963 [11:37:36] <_joe_> I'm looking at the logs and all of a sudden... that ^^ [11:37:39] 99 [11:38:16] <_joe_> it's completely out of the blue, so... no real root cause [11:38:52] ok lads, we'll discsuss that with ottomata when he arrivbes (better at kafka than I am :) [11:38:59] Thanks a lot for the support [11:40:47] yeah can't figure out why it'd do that ATM [11:41:34] <_joe_> ok I'm off for a few [11:42:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [11:47:37] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1765001 (10mobrovac) [11:47:38] 6operations, 10RESTBase, 6Services, 5Patch-For-Review: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1764999 (10mobrovac) 5Open>3Resolved Deployed, fixed: ``` icinga-wm: RECOVERY - Restbase endpoints health on restbase2001 is OK: All endpoints... [11:47:49] 6operations, 10RESTBase, 6Services: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1765002 (10mobrovac) [11:48:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [11:49:31] (03PS1) 10Muehlenhoff: openldap: Use require_package [puppet] - 10https://gerrit.wikimedia.org/r/249720 [11:55:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:01:39] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:10:59] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:12:02] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765022 (10JohnLewis) 3NEW [12:13:06] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765030 (10jcrespo) I am not ok with this task. [12:16:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:19:20] (03PS1) 10KartikMistry: Set ContentTranslationCXServerAuth for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249721 [12:20:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:22:01] Any deployer around for hotfix? [12:24:30] (03PS2) 10KartikMistry: Set ContentTranslationCXServerAuth for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249721 [12:24:42] No one I guess. [12:25:41] (03PS1) 10Giuseppe Lavagetto: Revert "Apertium: Add missing apertium-br-fr" [puppet] - 10https://gerrit.wikimedia.org/r/249723 [12:25:51] kart_: What's up? [12:26:14] <_joe_> kart_: I'm reverting this, it is causing puppet to fail [12:26:34] _joe_: no worry. [12:26:52] _joe_: package not installable? [12:27:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "This is making puppet fail on sca*" [puppet] - 10https://gerrit.wikimedia.org/r/249723 (owner: 10Giuseppe Lavagetto) [12:27:07] (03PS2) 10Giuseppe Lavagetto: Revert "Apertium: Add missing apertium-br-fr" [puppet] - 10https://gerrit.wikimedia.org/r/249723 [12:27:15] hoo: no worry, I'll deploy myself. [12:27:18] (03CR) 10Giuseppe Lavagetto: [V: 032] Revert "Apertium: Add missing apertium-br-fr" [puppet] - 10https://gerrit.wikimedia.org/r/249723 (owner: 10Giuseppe Lavagetto) [12:27:30] ok [12:27:46] <_joe_> kart_: yes [12:27:52] <_joe_> not found, rather [12:28:00] 6operations, 10Wikidata, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1765071 (10jcrespo) I do not see this happening on enwiki. Checking on other wikis/hosts. [12:30:23] _joe_: it looks alex forgot to upload it. [12:30:37] anyway, not urgent. [12:30:53] (03CR) 10KartikMistry: [C: 032] Set ContentTranslationCXServerAuth for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249721 (owner: 10KartikMistry) [12:31:16] (03Merged) 10jenkins-bot: Set ContentTranslationCXServerAuth for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249721 (owner: 10KartikMistry) [12:33:23] !log kartik@tin Synchronized wmf-config/CommonSettings.php: Set ContentTranslationCXServerAuth for CX (duration: 00m 17s) [12:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:49] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:35:21] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:41:00] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:48:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:54:00] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:55:30] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:59:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [13:01:44] (03PS1) 10KartikMistry: CX: Fix ContentTranslationCXServerAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249728 [13:01:50] (03CR) 10jenkins-bot: [V: 04-1] CX: Fix ContentTranslationCXServerAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249728 (owner: 10KartikMistry) [13:04:23] (03PS2) 10KartikMistry: CX: Fix ContentTranslationCXServerAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249728 [13:04:30] (03CR) 10jenkins-bot: [V: 04-1] CX: Fix ContentTranslationCXServerAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249728 (owner: 10KartikMistry) [13:06:41] Keep calm, Kartik, Keep calm! [13:06:49] (03PS3) 10KartikMistry: CX: Fix ContentTranslationCXServerAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249728 [13:11:12] (03PS1) 10Hoo man: Use pbzip2 -p3 to compress Wikidata JSON dumps on snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/249729 [13:14:16] !log kartik@tin Synchronized private/PrivateSettings.php: Fix name of JWT token (duration: 00m 18s) [13:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [13:15:33] (03CR) 10KartikMistry: [C: 032] "hotfix." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249728 (owner: 10KartikMistry) [13:15:40] (03Merged) 10jenkins-bot: CX: Fix ContentTranslationCXServerAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249728 (owner: 10KartikMistry) [13:17:08] !log kartik@tin Synchronized wmf-config/CommonSettings.php: Fix ContentTranslationCXServerAuth for CX (duration: 00m 18s) [13:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [13:24:43] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765162 (10Krenair) Why? [13:27:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [13:28:01] HMMM [13:32:45] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765164 (10jcrespo) This task is not blocking anything, so it is not quickly actuable. By the time it is needed, tendril may have been converted into something else (grafana + scripts). It handles *very private data*, it... [13:35:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [13:38:12] ACKNOWLEDGEMENT - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] ottomata This is strange indeed. Everything is ok, and this node is catching back up. But Im not sure why it started lagging in the first place. [13:39:52] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765167 (10Halfak) > I think understanding the semantics of an event primarily requires knowledge of the topic. This is true if you are consuming fr... [13:43:00] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765168 (10JohnLewis) >>! In T117034#1765164, @jcrespo wrote: > It handles *very private data*, it should be part of the production network. I think there is not infrastructure there yet handling VMs. Because of firewall a... [13:47:20] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: puppet fail [13:52:10] 6operations, 7Database: Replicate flowdb from X1 to analytics-store - https://phabricator.wikimedia.org/T75047#1765187 (10jcrespo) a:5Springle>3jcrespo [13:54:02] 6operations, 7Database: Replicate flowdb from X1 to analytics-store - https://phabricator.wikimedia.org/T75047#760294 (10jcrespo) I am trying this now with the resources we have- I cannot guarantee how well it work, but I cannot continue blocking this for so long :-/ [13:56:23] what is http://deployment-urldownloader.eqiad.wmflabs:8080 in Production? [13:58:11] kart_: You mean the URL? http://url-downloader.wikimedia.org:8080 [13:58:51] Reedy: thanks! [13:59:05] You can use $wgCopyUploadProxy too [13:59:18] (should?) [13:59:20] but that has no protocol set [14:05:57] (03PS1) 10KartikMistry: CX: Add Proxy for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/249735 [14:06:41] godog: for you another patch :) ^ [14:09:26] <_joe_> dcausse: I have gone late with the interview [14:09:28] <_joe_> my bad [14:15:01] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:21:40] heya apergos, yt? [14:22:03] yes, what's up, ottomata? [14:22:27] so, got a new cron job on dumps that runs as datasets [14:22:42] when I run manually in shell, is fine. syslog shows that the job is run as scheduled [14:22:47] which one is it? [14:22:57] # Puppet Name: dataset-pageview [14:23:00] 51 * * * * /usr/bin/rsync -rt --delete --chmod=go-w stat1002.eqiad.wmnet::hdfs-archive/{pageview,projectview}/legacy/hourly/ /data/xmldatadumps/public/other/pageviews/ [14:23:23] but, no new data is copied over hourly, i just ran it manualy and it did what I'd expect [14:23:25] runs on datasets host right? [14:23:27] yes [14:23:32] well let's see [14:23:33] do you get emails to ops-dumps@wikimedia.org [14:23:34] ? [14:23:37] yes indeed [14:23:42] maybe something in there? [14:23:47] gona check that right now [14:23:49] k danke [14:24:31] rsync: change_dir "/{pageview,projectview}/legacy/hourly" (in hdfs-archive) failed: No such file or directory [14:24:37] hm! [14:24:47] weird! it works fine on CLI [14:24:48] that is strange [14:24:53] different shell maybe [14:24:57] hmmmMMMM [14:24:58] yeah probably sh [14:25:00] will check that [14:25:03] great [14:25:27] yes, that is it [14:25:28] thank you. [14:25:31] sure! [14:25:33] will just make two jobs [14:25:38] for each of those dirs [14:27:14] hmm, or, i think bash -c will work [14:30:07] (03PS1) 10Jcrespo: Addind needed files to setup research access to a flow replica [puppet] - 10https://gerrit.wikimedia.org/r/249743 (https://phabricator.wikimedia.org/T75047) [14:30:42] (03PS1) 10Ottomata: dataset::cron::job now runs rsync command via bash instead of sh [puppet] - 10https://gerrit.wikimedia.org/r/249744 [14:31:02] (03PS2) 10Jcrespo: Addind needed files to setup research access to a flow replica [puppet] - 10https://gerrit.wikimedia.org/r/249743 (https://phabricator.wikimedia.org/T75047) [14:31:48] (03CR) 10Ottomata: [C: 032] dataset::cron::job now runs rsync command via bash instead of sh [puppet] - 10https://gerrit.wikimedia.org/r/249744 (owner: 10Ottomata) [14:32:26] (03PS3) 10Jcrespo: Addind needed files to setup research access to a flow replica [puppet] - 10https://gerrit.wikimedia.org/r/249743 (https://phabricator.wikimedia.org/T75047) [14:36:00] Who can review: https://gerrit.wikimedia.org/r/#/c/249735/ ? [14:37:57] kart_, I'm not sure that will work [14:38:14] Doesn't that need to be cxserver::proxy instead of just 'proxy'? [14:39:13] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765314 (10jcrespo) > Things can't be rejected I have not rejected it, I said I do not agree with it and that I am not going to work on it. [14:41:48] jynus: if I wanted you to work on it, I would have assigned it to you :) [14:42:27] Krenair: I see. Checking. [14:42:38] Krenair: see logstash. I put it same. [14:42:49] who is going to work on it, JohnFLewis ? [14:43:05] Krenair: it is cxserver.yaml, so should work. [14:43:49] Who ever picks it up. It could also naturally fall into volunteer ground and an opsen just merges and does the server side stuff necessary [14:44:05] where am I looking for this logstash entry kart_? [14:44:13] Krenair: same file. [14:44:16] ok, I've just stated my opinion [14:44:26] hieradata/common/cxserver.yaml [14:44:45] If someone assign is to me, I will close it a invalid, ok? [14:44:56] *asigns it [14:45:22] no invalid, declined [14:45:33] jynus: or just u assign yourself and say anyone can do it. Not really something that you need to do, you just need to be aware when it happens [14:46:08] Looking at it, apply role to a machine, change db grants and DNS is all that's really necessary. As long new box can talk to the db backend all should be fine? [14:46:17] (03PS1) 10BBlack: More efficient capture->processing for cipher_sim on many machines... [puppet] - 10https://gerrit.wikimedia.org/r/249745 [14:46:30] (03PS1) 10Ottomata: Deploy varnishreqstats on all text caches [puppet] - 10https://gerrit.wikimedia.org/r/249746 (https://phabricator.wikimedia.org/T83580) [14:46:44] JohnFLewis, you say so- but you don not suffer its downtime [14:46:55] (03CR) 10jenkins-bot: [V: 04-1] More efficient capture->processing for cipher_sim on many machines... [puppet] - 10https://gerrit.wikimedia.org/r/249745 (owner: 10BBlack) [14:47:04] I do not want to continue discussing that [14:47:24] jynus: would there realistically be downtime fall out though? Since the backend is unified, as long as both boxes can communicate shouldn't be an issue [14:47:27] (03CR) 10Ottomata: [C: 032] Deploy varnishreqstats on all text caches [puppet] - 10https://gerrit.wikimedia.org/r/249746 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [14:47:33] Okay, if you don't want to :) [14:58:12] ottomata: btw I was looking at your otto stats in grafana: it seems like cp1065 is logging zero 5xx so far? that seems probably not-right [14:59:22] bblack, looking back at the last 24 hours [14:59:30] i see some 5xxes on eqiad.text [15:00:05] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151029T1500). [15:01:21] bblack [15:01:22] https://graphite.wikimedia.org/render/?width=588&height=311&target=varnish.eqiad.text.frontend.request.client.status.5xx.rate [15:03:20] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [15:03:27] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765348 (10Ottomata) > @Ottomata, I think understanding the semantics of an event primarily requires knowledge of the topic. Hm, I don't think this... [15:07:31] 6operations, 10Beta-Cluster-Infrastructure, 7WorkType-NewFunctionality: etcd/confd is not started on deployment-cache-mobile04 - https://phabricator.wikimedia.org/T116224#1765358 (10hashar) ``` deployment-cache-text04:/etc# cat /lib/systemd/system/confd.service [Unit] Description=confd [Service] User=root E... [15:10:56] (03PS1) 10Faidon Liambotis: Add mr1-codfw OOB IP [dns] - 10https://gerrit.wikimedia.org/r/249748 (https://phabricator.wikimedia.org/T116694) [15:11:21] (03CR) 10Faidon Liambotis: [C: 032] Add mr1-codfw OOB IP [dns] - 10https://gerrit.wikimedia.org/r/249748 (https://phabricator.wikimedia.org/T116694) (owner: 10Faidon Liambotis) [15:13:39] (03PS4) 10Jcrespo: Addind needed files to setup research access to a flow replica [puppet] - 10https://gerrit.wikimedia.org/r/249743 (https://phabricator.wikimedia.org/T75047) [15:15:42] !log removed openjdk-7 on restbase100[1-9] (it's using openjdk-8 for a while) [15:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:20] (03PS5) 10Jcrespo: Addind needed files to setup research access to a flow replica [puppet] - 10https://gerrit.wikimedia.org/r/249743 (https://phabricator.wikimedia.org/T75047) [15:16:54] (03CR) 10Jcrespo: [C: 032] Addind needed files to setup research access to a flow replica [puppet] - 10https://gerrit.wikimedia.org/r/249743 (https://phabricator.wikimedia.org/T75047) (owner: 10Jcrespo) [15:19:45] 6operations, 7Database, 5Patch-For-Review: Replicate flowdb from X1 to analytics-store - https://phabricator.wikimedia.org/T75047#1765407 (10jcrespo) I've added flowdb to analytics-store. I cannot guarantee how well it will work, x1 traffic is very "particular". Let's assume it is in "beta", and you can giv... [15:20:10] (03CR) 10BryanDavis: [C: 031] "This would fix the mtime setting problem that master-master rsync has. I don't think it makes anything fundamentally less secure as this u" [puppet] - 10https://gerrit.wikimedia.org/r/249684 (https://phabricator.wikimedia.org/T117016) (owner: 10Alex Monk) [15:22:08] (03CR) 10Alex Monk: "see task, this is cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/249684 (https://phabricator.wikimedia.org/T117016) (owner: 10Alex Monk) [15:25:17] <_joe_> bd808: so, multi-master scap, do you need assistance from your not-so-friendly opsen? [15:25:19] (03PS2) 10KartikMistry: CX: Add Proxy for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/249735 [15:27:31] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, 3labs-sprint-117: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1765434 (10Andrew) [15:27:42] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, 3labs-sprint-117: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1765435 (10Andrew) [15:28:42] 6operations, 7Database: Check, test and tune pool-of-connections and max_connections configuration - https://phabricator.wikimedia.org/T112479#1765436 (10jcrespo) 5Open>3Resolved This requires long-term monitoring, but I have found no issues so far. Closing as resolved for now, will reopen if issues are ac... [15:29:38] (03PS2) 10Jcrespo: Set MariaDB 10 as the default version when using WMF packages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244657 [15:31:05] (03CR) 10Alexandros Kosiaris: [C: 031] CX: Add Proxy for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/249735 (owner: 10KartikMistry) [15:31:21] (03CR) 10Jcrespo: "@ottomata, does this change affect you?" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244657 (owner: 10Jcrespo) [15:32:26] (03CR) 10Ottomata: "Looks like the one place I'm using it is already setting this to true:" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244657 (owner: 10Jcrespo) [15:32:28] (03CR) 10Hashar: [C: 031] Make mediawiki-config clone be owned by mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/249684 (https://phabricator.wikimedia.org/T117016) (owner: 10Alex Monk) [15:33:09] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, 3labs-sprint-117: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1765459 (10faidon) a:5mark>3chasemp `17:28 < chasemp> I would like to outline it and take care of it` [15:33:26] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, 3labs-sprint-117: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1765464 (10faidon) a:5mark>3chasemp [15:34:19] 10Ops-Access-Requests, 6operations: Add Matanya to "restricted" to perform server side uploads - https://phabricator.wikimedia.org/T106447#1765469 (10mark) 5stalled>3declined Alright, since I do need to give an update: I'm going to reject it for this purpose. We really should not have to provide shell acce... [15:36:10] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765473 (10chasemp) p:5Triage>3Lowest [15:37:11] (03PS3) 10KartikMistry: CX: Add proxy and yandex_url for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/249735 [15:37:47] 6operations, 6Analytics-Engineering: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1765475 (10chasemp) p:5Triage>3Normal [15:37:57] (03PS2) 10BryanDavis: Make mediawiki-config clone be owned by mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/249684 (https://phabricator.wikimedia.org/T117016) (owner: 10Alex Monk) [15:38:54] 6operations, 6Analytics-Engineering: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1765481 (10chasemp) a:3Ottomata @ottomata I am going to toss your way as even though we OK on the check for the moment you have the best idea of whether this needs longer term attention or not [15:39:06] 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1765483 (10chasemp) p:5Triage>3Normal [15:40:42] (03PS2) 10Giuseppe Lavagetto: maintenance: move purge_securepoll off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249702 (https://phabricator.wikimedia.org/T116728) [15:40:44] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765493 (10GWicke) @ottomata: Based on our backwards-compatibility rules, the latest schema will be a superset of previous schemas. This means that y... [15:41:16] 6operations, 6Analytics-Engineering: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1765495 (10Ottomata) 5Open>3Resolved Looks like all is well: /dev/mapper/tank-home 1008G 561G 448G 56% /home Thanks! [15:41:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249702 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [15:42:26] (03PS4) 10KartikMistry: CX: Add proxy and yandex_url for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/249735 [15:43:08] <_joe_> !log moving purge_securepoll from terbium to mw1152 [15:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:13] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move purge_checkuser off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249703 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [15:45:18] (03PS2) 10Giuseppe Lavagetto: maintenance: move purge_checkuser off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249703 (https://phabricator.wikimedia.org/T116728) [15:48:49] <_joe_> !log moving purge_checkuser from terbium to mw1152 [15:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:33] (03CR) 10Andrew Bogott: [C: 032] "Puppet compiler also approves." [puppet] - 10https://gerrit.wikimedia.org/r/249678 (owner: 10Dzahn) [15:57:41] (03PS2) 10Andrew Bogott: labstore: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249678 (owner: 10Dzahn) [15:58:51] (03CR) 10Andrew Bogott: [C: 032] labstore: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249678 (owner: 10Dzahn) [16:00:05] Coren jynus: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151029T1600). [16:00:05] kart_: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:16] yep [16:02:04] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: admin script to do cleanup, enter maintenance mode, etc [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/234971 (owner: 10ArielGlenn) [16:02:49] Who is Puppet SWAT'ng? [16:03:01] Coren: jynus ? [16:03:20] I am checking, kart_ [16:03:34] It is trivial but want to check it in any case [16:03:44] Sure! [16:03:51] (03PS1) 10ArielGlenn: dumpadmin script: add "rerun" which reruns a broken job [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249754 [16:05:38] (03PS1) 10BryanDavis: logstash: Add a "synced flush" command when optimizing an index [puppet] - 10https://gerrit.wikimedia.org/r/249756 [16:06:51] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765592 (10Ottomata) Have we decided that defaults will be filled in for missing fields? [16:06:56] you chose... wisely: https://puppet-compiler.wmflabs.org/1122/ [16:08:58] kart_: Either/or. [16:09:12] kart_: Sorry, forgot the time matched my lunch - just back with food. [16:09:24] Coren: jynus is taking care. [16:09:48] Ah, just the one patch. [16:10:29] (03PS5) 10Jcrespo: CX: Add proxy and yandex_url for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/249735 (owner: 10KartikMistry) [16:11:01] Coren, see my last jenkins job [16:11:22] (03CR) 10Jcrespo: [C: 032] CX: Add proxy and yandex_url for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/249735 (owner: 10KartikMistry) [16:11:28] (03CR) 10Ori.livneh: [C: 031] "yes please" [puppet] - 10https://gerrit.wikimedia.org/r/249704 (owner: 10Giuseppe Lavagetto) [16:11:45] (03PS3) 10Giuseppe Lavagetto: maintenance: make purge_abusefilter run daily, not minutely for 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/249704 [16:11:58] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249704 (owner: 10Giuseppe Lavagetto) [16:12:20] jynus: go ahead. [16:12:26] oh. you did. [16:12:37] (03CR) 10Ori.livneh: [C: 031] maintenance: move purge_abusefilter off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249705 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [16:13:08] <_joe_> jynus: can I merge both patches? [16:13:18] yes [16:13:21] <_joe_> mine and yours? [16:13:24] both [16:13:29] <_joe_> ok [16:13:29] (03PS2) 10Rush: logstash: Add a "synced flush" command when optimizing an index [puppet] - 10https://gerrit.wikimedia.org/r/249756 (owner: 10BryanDavis) [16:13:32] I do not know yours [16:13:37] but mine, yes [16:13:39] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765600 (10GWicke) @ottomata, they will be filled in somewhere, but I think we haven't necessarily decided on filling them in at production time. To... [16:13:39] :-) [16:13:46] (03CR) 10Rush: [C: 032 V: 032] logstash: Add a "synced flush" command when optimizing an index [puppet] - 10https://gerrit.wikimedia.org/r/249756 (owner: 10BryanDavis) [16:14:04] sorry for being so slow [16:15:00] I continuously lose my terminals and browser tabs [16:15:09] <_joe_> jynus: np, happens to everyone :) [16:15:22] <_joe_> I have like 10 phab tabs and 15 gerrit ones atm [16:15:51] jynus: how much time it will take to see changes live? [16:15:54] and lately, I have to look at IRC to get some order of thet thigs I did and didnt [16:16:05] kart_, I am loging right now to the hosts [16:16:09] give me 1 sec [16:16:33] (03PS2) 10Giuseppe Lavagetto: maintenance: move purge_abusefilter off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249705 (https://phabricator.wikimedia.org/T116728) [16:16:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249705 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [16:17:21] (03PS1) 10EBernhardson: Add cirrussearch to dumps.wikimedia.org/other html page [puppet] - 10https://gerrit.wikimedia.org/r/249761 (https://phabricator.wikimedia.org/T109690) [16:17:35] you can already test sca1001 [16:17:44] kart_^ [16:17:57] cxserver reloaded [16:18:30] okay. Thanks! [16:18:41] everyting looks ok? [16:19:02] <_joe_> !log moved purge_abusefilter from terbium to mw1152 [16:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:53] (03PS1) 10Luke081515: Add reupload-shared right to autoconfirmed users (ruwikivoyage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249762 (https://phabricator.wikimedia.org/T116575) [16:21:11] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765626 (10Ottomata) Producer A has schema version 1. Producer B has schema version 2, which has added field "name" with default "nonya". All of t... [16:22:01] I will move the replication job from terbium to dbstore1002 and db1047, puppetize it and setup monitoring [16:22:11] but probably not this week [16:22:32] !log removed obsolete mysql 5.5 packages on mw102[2-9], mw1032, mw1053, mw1114, mw1163 [16:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:35] <_joe_> jynus: yeah not a problem, once I am done moving the scripts I can simply uninstall mediawiki from terbium and leave it around for as long as you need [16:24:48] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765642 (10GWicke) @ottomata, you are basically making the case for filling in the defaults at consumption time. [16:25:23] no, _joe_ I need to do that, and that was a good reason [16:25:27] :-) [16:26:12] I will also salt-discover springle screen sessions [16:27:07] ottomata: I meant I don't ever see avg/current (or graph hover data) in https://grafana-admin.wikimedia.org/dashboard/db/reqstats-otto ever showingh text 5xx [16:27:30] vs https://gdash.wikimedia.org/dashboards/reqerror/ (which includes upload of course, which might be the major source) [16:28:03] but all zeros in reqstats-otto AFAICS [16:28:15] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244657 (owner: 10Jcrespo) [16:30:24] (03PS2) 10BBlack: More efficient capture->processing for cipher_sim on many machines... [puppet] - 10https://gerrit.wikimedia.org/r/249745 [16:31:43] (03CR) 10BBlack: [C: 032] More efficient capture->processing for cipher_sim on many machines... [puppet] - 10https://gerrit.wikimedia.org/r/249745 (owner: 10BBlack) [16:36:05] could kafka lag issues have something to do with that? [16:36:14] (03PS1) 10BBlack: standard packages: add tshark (non-gui wireshark) [puppet] - 10https://gerrit.wikimedia.org/r/249765 [16:36:55] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1765669 (10RobH) a:5RobH>3mark With @ArielGlenn's update, I've reassigned this back to @mark for his review. [16:37:11] (03CR) 10Jcrespo: "I agree, I will review all uses and change it overal." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244657 (owner: 10Jcrespo) [16:39:26] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765677 (10Ottomata) Or produce time. But really, even if we fill in defaults during production or consumption, this will still be a problem for his... [16:44:43] etherpad is dead :( [16:45:29] wait [16:45:35] i've been curious aboutit [16:45:39] the cause of the crashes [16:45:47] instead of restarting could you possibly make do without it for a few minutes? [16:45:56] s/crashes/freezes [16:45:58] it's trying to come back for me right now, fwiw [16:46:12] I no longer get the WMF 503 page [16:46:34] ori: it's probably the RPG players again [16:47:01] :( [16:48:49] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:49:40] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:50:03] that is mr BB's [16:51:27] ori: flappig again [16:51:30] for me [16:52:51] is there a command line hhvm available on tin or somewhere? [16:54:48] ori: greg-g imma restart [16:57:10] YuviPanda: kk [16:57:26] Nikerabbit: not yet, but soon. you can run hhvm on other app servers, though. [16:57:32] !log restart etherpad on etherpad1001 [16:57:35] greg-g: better? [16:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:53] Nikerabbit: try: SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mw1017 [16:57:55] from tin [16:58:41] PROBLEM - puppet last run on mw1074 is CRITICAL: CRITICAL: Puppet has 1 failures [16:59:01] jynus: yeah that's me, but it's totally not an important commit, which is why I didn't think to merge :) [16:59:30] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 1 failures [17:00:00] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:00:25] looking at 503s is upload with the regular (not issue) with long thumbnail names [17:01:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:02:08] YuviPanda: I think, we're endin gour meeitng now :) [17:04:30] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures [17:05:17] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765766 (10Dzahn) Hey, nobody ever said "labs". This has always just been about a move within the production network from one host to another. He was just trying to find something to move off of neon because neon is kind... [17:09:07] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765782 (10jcrespo) I have not rejected it, I said I do not agree with it and that I am not going to work on it. Good luck convincing someone working on it, and even if you do, convincing me that this will not create downt... [17:10:10] !log kartik@tin Synchronized private/PrivateSettings.php: Retry syncing for CX token (duration: 00m 17s) [17:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:35] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1765791 (10RobH) [17:11:46] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1765793 (10RobH) [17:17:29] !log kartik@tin Synchronized wmf-config/PrivateSettings.php: Really sync right file this time (duration: 00m 17s) [17:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:07] 6operations, 7Database: Move tendril to a VM - https://phabricator.wikimedia.org/T117034#1765806 (10ori) 5Open>3declined a:3ori We have **one** DBA. However much Neon is overloaded and is a SPOF, it is less overloaded and less of a SPOF than Jaime is. The question we should be asking is how to move tasks... [17:18:57] ori: clarify? [17:19:20] I added a comment; what isn't clear? [17:20:22] ori: you closed it stating Jaime has no time yet we agreed it's not s Jaime task [17:20:36] why not keep it open and lowest prio if the actual request isn't controversial? [17:20:39] > Good luck convincing someone working on it, and even if you do, convincing me that this will not create downtime so I will have to block it. [17:20:40] it was never meant to be a task assigned to him [17:20:49] it's moving a puppet role [17:20:53] i.e., even if someone else takes it on, there is a potential for disruption [17:20:55] yeah, it seemed like an overreaction [17:21:09] No disruption at all [17:21:22] according to whom? [17:21:38] Firstly no one has said why it creates disruption [17:21:52] it can run on multiple hosts at the same time. i don't see any [17:22:28] because that would double the queries it sends to servers to get their status [17:22:31] The most disruption would be a DNS change which even then won't be disruptive as regardless of where it resolves, it's still a working tendril install as all data is in the database backend [17:22:47] as long as it's added/checked/dns update/removed from the old one in those order it should be fine in theory [17:23:01] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:23:07] * AaronSchulz looks at tendril a lot, I guess I might notice if it breaks ;) [17:23:17] But that's not disruption as he was meaning. He meant downtime. [17:23:23] i just think the whole task demonstrates a fundamentally flawed take on which resources are actually scarce [17:23:45] What resources are scarce in terms of that task? [17:23:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [17:23:50] "we're going to do shit to some tool you depend on" is going to steal attention [17:23:58] even if it's just to sanity-check [17:23:59] ori: there's no reason to decline it though, it can stay open and lowest prio without actually making any DBA's life worse [17:24:03] and it's a stressor [17:24:07] so just move it away for now [17:24:33] afaict, at least [17:24:46] wth..it was never assigned to anyone. it's a valid task. he is trying to help us get load off of neon. if we have no time for it now ok, we get it. but saying "good luck" and declined.. really? [17:25:22] i said re-open it later [17:25:40] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:25:41] without explaining why an open/lowest prio task hurts anyone :) [17:25:56] Nothing changes if we re-open it in a month or keep it open now. [17:26:11] because the low-level chatter is distracting [17:26:18] umm, can someone help us to fix the "Undefined variable" notice visible in fatalmonitor? as far as I can see kart_ has synced the right file. [17:26:33] I was actually going to submit patches for this either tomorrow or at the weekend. All that close does it says 'thanks for the offer, but no' [17:26:40] just not acting on it makes more sense if the reasoning is we are all too busy [17:26:44] ori: I think the contention is that that chatter you reference was misguided and misunderstood by jaime [17:26:59] because you shouldn't do this now [17:27:14] even if you own the process end-to-end, this is still going to require a sanity check from jaime, and he has other things to worry about [17:27:19] Patches remain relevant for ever though. I'm not forcing people to merge them [17:27:27] so just don't, keep the radio channel clear for important communication [17:27:46] Nikerabbit: what error? wmgContentTranslationCXServerAuthKey ? [17:27:54] that's not really a fair way of dealing with noise, but whatever [17:27:58] AaronSchulz: yeah [17:28:02] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1765848 (10RobH) w.wiki is planed to be part of our next unified certificate as a SAN. The order of this certificate is pending. [17:28:31] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:30:36] no, not cool [17:32:15] !log aaron@tin Synchronized private/PrivateSettings.php: (no message) (duration: 00m 17s) [17:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:07] 6operations, 7Monitoring: non sms alternatives - https://phabricator.wikimedia.org/T114651#1765855 (10RobH) p:5High>3Low I find data to work less reliably than SMS in intermittent coverage. SMS messages require less connectivity overall than data in my experience. That being said, we don't have near conc... [17:33:17] !log aaron@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 18s) [17:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:23] AaronSchulz: what was that ^? please leave a message when deploying :) [17:35:47] trying to stop wmgContentTranslationCXServerAuthKey errors (see backscroll) [17:35:54] ahh [17:35:59] very mysterious, as it works on tin [17:36:05] just 'touch' of IS.php [17:36:13] * greg-g hasn't been keeping up on backscroll lately, stupid physical bodies breaking down [17:36:38] !log Did touch of InitialiseSettings.php [17:36:41] :) [17:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:15] still happening :( [17:37:50] PROBLEM - DPKG on cp2025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:37:56] ifi t's every request the log channel could be backlogged [17:37:59] PROBLEM - DPKG on cp2020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:00] PROBLEM - DPKG on cp4003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:00] ^ that's me (dpkg on cp*) [17:38:00] PROBLEM - DPKG on cp4002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:00] PROBLEM - DPKG on cp4017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:00] PROBLEM - DPKG on cp4016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:00] PROBLEM - DPKG on cp4019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:01] and the errors could be stale [17:38:01] PROBLEM - DPKG on cp1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:01] PROBLEM - DPKG on cp4015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:02] PROBLEM - DPKG on cp4009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:02] PROBLEM - DPKG on cp4018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:03] PROBLEM - DPKG on cp4006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:03] PROBLEM - DPKG on cp4014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:04] PROBLEM - DPKG on cp2022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:04] PROBLEM - DPKG on cp2023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:10] PROBLEM - DPKG on cp2021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:10] PROBLEM - DPKG on cp3035 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:10] PROBLEM - DPKG on cp3032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:10] PROBLEM - DPKG on cp3030 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:10] PROBLEM - DPKG on cp3042 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:10] PROBLEM - DPKG on cp1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:11] PROBLEM - DPKG on cp3037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:22] PROBLEM - DPKG on cp2019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:22] PROBLEM - DPKG on cp2024 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:23] PROBLEM - DPKG on cp2008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:23] PROBLEM - DPKG on cp2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:24] PROBLEM - DPKG on cp1067 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:24] PROBLEM - DPKG on cp1071 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:29] PROBLEM - DPKG on cp2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:30] PROBLEM - DPKG on cp1073 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:30] PROBLEM - DPKG on cp1053 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:30] PROBLEM - DPKG on cp1043 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:31] PROBLEM - DPKG on cp2013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:31] PROBLEM - DPKG on cp2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:33] the interaction of our dpkg check, puppet runs, and actual apt commands is horrid. nothing is wrong :P [17:38:39] PROBLEM - DPKG on cp1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:39] PROBLEM - DPKG on cp1069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:40] PROBLEM - DPKG on cp2012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:40] PROBLEM - DPKG on cp1060 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:40] PROBLEM - DPKG on cp1063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:40] PROBLEM - DPKG on cp1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:49] PROBLEM - DPKG on cp1045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:49] PROBLEM - DPKG on cp1059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:50] PROBLEM - DPKG on cp4011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:50] PROBLEM - DPKG on cp4004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:50] PROBLEM - DPKG on cp4013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:51] PROBLEM - DPKG on cp4007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:51] PROBLEM - DPKG on cp2015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:51] PROBLEM - DPKG on cp1066 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:51] ori: I'm still observing the issue caused by this though [17:38:59] PROBLEM - DPKG on cp1044 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:38:59] PROBLEM - DPKG on cp1049 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:00] PROBLEM - DPKG on cp1055 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:00] PROBLEM - DPKG on cp1058 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:00] PROBLEM - DPKG on cp1048 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:00] PROBLEM - DPKG on cp1074 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:00] PROBLEM - DPKG on cp3039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:01] PROBLEM - DPKG on cp1050 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:01] PROBLEM - DPKG on cp2018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:02] PROBLEM - DPKG on cp3003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:02] PROBLEM - DPKG on cp3038 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:03] PROBLEM - DPKG on cp3049 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:03] PROBLEM - DPKG on cp3041 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:04] PROBLEM - DPKG on cp3040 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:06] Nikerabbit: does it work on mw1017? [17:39:20] PROBLEM - DPKG on cp3022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:20] PROBLEM - DPKG on cp3034 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:20] PROBLEM - DPKG on cp3036 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:20] PROBLEM - DPKG on cp3047 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:20] PROBLEM - DPKG on cp3031 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:21] PROBLEM - DPKG on cp3021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:21] PROBLEM - DPKG on cp3010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:22] PROBLEM - DPKG on cp1056 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:22] PROBLEM - DPKG on cp1052 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:23] PROBLEM - DPKG on cp3008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:23] PROBLEM - DPKG on cp3013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:24] PROBLEM - DPKG on cp1099 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:29] PROBLEM - DPKG on cp2011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:30] PROBLEM - DPKG on cp2017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:30] PROBLEM - DPKG on cp2016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:31] PROBLEM - DPKG on cp2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:39] PROBLEM - DPKG on cp2009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:40] PROBLEM - DPKG on cp2014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:39:40] PROBLEM - DPKG on cp2010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:40:23] ori: seems so (at least in mwscript eval.php there, if that's what you meant) [17:41:50] RECOVERY - DPKG on cp3033 is OK: All packages OK [17:42:00] RECOVERY - DPKG on cp1047 is OK: All packages OK [17:42:00] RECOVERY - DPKG on cp1051 is OK: All packages OK [17:42:00] RECOVERY - DPKG on cp2001 is OK: All packages OK [17:42:00] RECOVERY - DPKG on cp2005 is OK: All packages OK [17:42:00] RECOVERY - DPKG on cp2003 is OK: All packages OK [17:42:00] RECOVERY - DPKG on cp1068 is OK: All packages OK [17:42:00] RECOVERY - DPKG on cp2019 is OK: All packages OK [17:42:01] RECOVERY - DPKG on cp2024 is OK: All packages OK [17:42:01] RECOVERY - DPKG on cp2008 is OK: All packages OK [17:42:02] RECOVERY - DPKG on cp2004 is OK: All packages OK [17:42:02] RECOVERY - DPKG on cp1067 is OK: All packages OK [17:42:03] RECOVERY - DPKG on cp1071 is OK: All packages OK [17:42:09] RECOVERY - DPKG on cp2002 is OK: All packages OK [17:42:10] RECOVERY - DPKG on cp1073 is OK: All packages OK [17:42:10] RECOVERY - DPKG on cp1053 is OK: All packages OK [17:42:10] RECOVERY - DPKG on cp1043 is OK: All packages OK [17:42:10] RECOVERY - DPKG on cp2013 is OK: All packages OK [17:42:11] RECOVERY - DPKG on cp2007 is OK: All packages OK [17:42:19] RECOVERY - DPKG on cp1072 is OK: All packages OK [17:42:20] RECOVERY - DPKG on cp1069 is OK: All packages OK [17:42:20] RECOVERY - DPKG on cp1060 is OK: All packages OK [17:42:20] RECOVERY - DPKG on cp2012 is OK: All packages OK [17:42:20] RECOVERY - DPKG on cp1063 is OK: All packages OK [17:42:20] RECOVERY - DPKG on cp1070 is OK: All packages OK [17:42:29] RECOVERY - DPKG on cp1045 is OK: All packages OK [17:42:29] RECOVERY - DPKG on cp1059 is OK: All packages OK [17:42:30] RECOVERY - DPKG on cp4011 is OK: All packages OK [17:42:30] RECOVERY - DPKG on cp4004 is OK: All packages OK [17:42:31] RECOVERY - DPKG on cp4007 is OK: All packages OK [17:42:31] RECOVERY - DPKG on cp4013 is OK: All packages OK [17:42:31] RECOVERY - DPKG on cp2015 is OK: All packages OK [17:42:31] RECOVERY - DPKG on cp1066 is OK: All packages OK [17:42:39] RECOVERY - DPKG on cp1044 is OK: All packages OK [17:42:39] RECOVERY - DPKG on cp1049 is OK: All packages OK [17:42:40] RECOVERY - DPKG on cp1055 is OK: All packages OK [17:42:40] RECOVERY - DPKG on cp1058 is OK: All packages OK [17:42:40] RECOVERY - DPKG on cp1048 is OK: All packages OK [17:42:40] RECOVERY - DPKG on cp1074 is OK: All packages OK [17:42:40] RECOVERY - DPKG on cp1050 is OK: All packages OK [17:42:41] RECOVERY - DPKG on cp3039 is OK: All packages OK [17:42:41] RECOVERY - DPKG on cp2018 is OK: All packages OK [17:42:42] RECOVERY - DPKG on cp3003 is OK: All packages OK [17:42:42] RECOVERY - DPKG on cp3038 is OK: All packages OK [17:42:43] RECOVERY - DPKG on cp3049 is OK: All packages OK [17:42:43] RECOVERY - DPKG on cp3040 is OK: All packages OK [17:42:44] RECOVERY - DPKG on cp3041 is OK: All packages OK [17:42:59] RECOVERY - DPKG on cp3022 is OK: All packages OK [17:43:00] RECOVERY - DPKG on cp3034 is OK: All packages OK [17:43:00] RECOVERY - DPKG on cp3036 is OK: All packages OK [17:43:00] RECOVERY - DPKG on cp3031 is OK: All packages OK [17:43:00] RECOVERY - DPKG on cp3021 is OK: All packages OK [17:43:00] RECOVERY - DPKG on cp3010 is OK: All packages OK [17:43:00] RECOVERY - DPKG on cp1056 is OK: All packages OK [17:43:01] RECOVERY - DPKG on cp1052 is OK: All packages OK [17:43:01] RECOVERY - DPKG on cp3008 is OK: All packages OK [17:43:02] RECOVERY - DPKG on cp3013 is OK: All packages OK [17:43:09] RECOVERY - DPKG on cp2011 is OK: All packages OK [17:43:09] RECOVERY - DPKG on cp2017 is OK: All packages OK [17:43:10] RECOVERY - DPKG on cp2006 is OK: All packages OK [17:43:29] RECOVERY - DPKG on cp2020 is OK: All packages OK [17:43:30] RECOVERY - DPKG on cp1054 is OK: All packages OK [17:43:30] RECOVERY - DPKG on cp4002 is OK: All packages OK [17:43:30] RECOVERY - DPKG on cp4017 is OK: All packages OK [17:43:30] RECOVERY - DPKG on cp4003 is OK: All packages OK [17:43:30] RECOVERY - DPKG on cp4016 is OK: All packages OK [17:43:31] RECOVERY - DPKG on cp2022 is OK: All packages OK [17:43:31] RECOVERY - DPKG on cp4019 is OK: All packages OK [17:43:32] RECOVERY - DPKG on cp4018 is OK: All packages OK [17:43:32] RECOVERY - DPKG on cp4006 is OK: All packages OK [17:43:33] RECOVERY - DPKG on cp4014 is OK: All packages OK [17:43:33] RECOVERY - DPKG on cp4015 is OK: All packages OK [17:43:34] RECOVERY - DPKG on cp4009 is OK: All packages OK [17:43:34] RECOVERY - DPKG on cp2023 is OK: All packages OK [17:43:39] RECOVERY - DPKG on cp2021 is OK: All packages OK [17:43:40] RECOVERY - DPKG on cp1046 is OK: All packages OK [17:43:40] RECOVERY - DPKG on cp3035 is OK: All packages OK [17:43:40] RECOVERY - DPKG on cp3032 is OK: All packages OK [17:43:40] RECOVERY - DPKG on cp3030 is OK: All packages OK [17:43:40] RECOVERY - DPKG on cp3042 is OK: All packages OK [17:44:49] RECOVERY - DPKG on cp3047 is OK: All packages OK [17:44:50] RECOVERY - DPKG on cp1099 is OK: All packages OK [17:45:00] RECOVERY - DPKG on cp2016 is OK: All packages OK [17:45:09] RECOVERY - DPKG on cp2009 is OK: All packages OK [17:45:09] RECOVERY - DPKG on cp2014 is OK: All packages OK [17:45:10] RECOVERY - DPKG on cp2010 is OK: All packages OK [17:45:10] RECOVERY - DPKG on cp2025 is OK: All packages OK [17:45:27] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1765886 (10RobH) [17:47:53] !log aaron@tin Synchronized wmf-config/PrivateSettings.php: (no message) (duration: 00m 18s) [17:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:51:17] Nikerabbit: fatalmonitor looking quieter now [17:51:59] AaronSchulz: Thanks! [17:52:06] AaronSchulz: what was the issue? [17:52:10] AaronSchulz: yes now it works! thanks [17:52:29] https://logstash.wikimedia.org/#dashboard/temp/AVC0uqwRptxhN1XaTLJe [17:53:01] !log Touched/synced PrivateSettings.php symlink via touch -h [17:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:29] ori, kart_ : could be hhvm caching based on mtime but not following links [17:53:32] (03PS2) 10Chad: ContentTranslation: Use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 [17:53:35] so maybe it thought the file was the same [17:53:47] worth investigating more [17:53:49] ah [17:57:26] AaronSchulz: sounds plausible, I have had similar issues with hhvm trying to read non-existing files when I deploy by updating a symlink [17:59:41] RECOVERY - configured eth on lvs1009 is OK: OK - interfaces up [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151029T1800). [18:02:49] RECOVERY - configured eth on lvs1011 is OK: OK - interfaces up [18:03:43] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. (There's a steady flow of CVE IDs for Wireshark, but they're essentially all limited to crashes in obscure dissectors, e" [puppet] - 10https://gerrit.wikimedia.org/r/249765 (owner: 10BBlack) [18:03:50] RECOVERY - configured eth on lvs1012 is OK: OK - interfaces up [18:04:30] RECOVERY - configured eth on lvs1008 is OK: OK - interfaces up [18:08:48] (03PS1) 10Alex Monk: beta: Use SSL to connect to restbase where necessary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249794 [18:09:06] (03CR) 10Legoktm: [C: 031] ContentTranslation: Use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 (owner: 10Chad) [18:09:54] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1765959 (10ori) 5Open>3Resolved >>! In T115711#1762753, @jcrespo wrote: > We have to drop the views on labs, too. Done. [18:10:49] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, for reference" [puppet] - 10https://gerrit.wikimedia.org/r/249765 (owner: 10BBlack) [18:11:39] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765962 (10GWicke) @ottomata: If you fill in the defaults at consumption time, then you have a choice of how you want to treat old events. You can ei... [18:12:29] (03CR) 10BBlack: [C: 032] standard packages: add tshark (non-gui wireshark) [puppet] - 10https://gerrit.wikimedia.org/r/249765 (owner: 10BBlack) [18:13:29] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765969 (10Ottomata) Events will be consumed into Hadoop close to production time (within an hour usually). Schema changes made years after the fact... [18:13:51] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1765970 (10RobH) a:5jcrespo>3RobH Claiming this task, as I've started the process of obtaining quotes on the associated blocking tasks. [18:13:57] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1765972 (10RobH) 5Open>3stalled [18:18:19] (03PS1) 10Ori.livneh: move error pages to errorpages/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249796 [18:19:26] (03PS3) 10Dzahn: contint: restore unattended upgrade on slaves [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [18:19:49] (03CR) 10Dzahn: [C: 032] contint: restore unattended upgrade on slaves [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [18:20:06] (03PS2) 10Ori.livneh: move error pages to errorpages/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249796 [18:20:58] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1765992 (10Cmjohnson) Connected all the lvs's to asw-d-eqiad. @bblack confirmed he sees them all. Leaving task open to monitor. [18:26:13] legoktm, any idea why https://gerrit.wikimedia.org/r/249794 doesn't work? [18:27:01] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] [18:27:23] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1766025 (10Cmjohnson) The physical disk has been replaced....requires fresh install [18:32:17] Krenair: what about isset( $_SERVER['HTTPS'] ) && $_SERVER['HTTPS'] === 'on' ? [18:33:08] is that a header set by varnish? [18:33:29] hrm [18:33:30] if ( isset( $_SERVER['HTTP_X_FORWARDED_PROTO'] ) && $_SERVER['HTTP_X_FORWARDED_P [18:33:30] ROTO'] == 'https' ) { [18:33:30] $wgCookieSecure = true; [18:33:30] $_SERVER['HTTPS'] = 'on'; // Fake this so MW goes into HTTPS mode [18:33:32] } [18:33:44] (in CommonSettings.php) [18:33:46] I'm not sure. [18:33:49] XFP is set by our edge stuff, yes [18:34:00] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:34:07] XFP is our standard indicator from the edge->app stuff that the request did come in over HTTPS [18:34:35] Krenair: what part of it isn't working? [18:34:50] it just doesn't seem to take effect at all [18:34:58] the $_SERVER['HTTPS'] part is PHP [18:35:13] " Set to a non-empty value if the script was queried through the HTTPS protocol. " [18:35:41] IMHO in https://gerrit.wikimedia.org/r/#/c/249794/1/wmf-config/CommonSettings.php there shouldn't be any detection, just a wmf-config setting to use HTTPS for restbase or whatever. [18:35:45] Krenair: beta doesn't have HTTPS I thought? [18:35:49] Yes it does? [18:36:01] I put a self-signed cert in over a week ago [18:36:01] because for all of prod $wgServerName/api/rest_v1 should just always be https:// [18:36:14] the only reason to differentiate it at all is the beta-cluster stuff might still need non-https [18:36:28] and no, beta doesn't have correctly working https in every sense [18:37:07] Krenair: oh. if restbase in beta supports https, why don't all connections use it? [18:37:20] it doesn't I don't think, at least not in a way that's going to work consistently [18:37:32] (03PS3) 10Dzahn: partman: remove unused mailserver recipe [puppet] - 10https://gerrit.wikimedia.org/r/249677 (owner: 10John F. Lewis) [18:37:38] nothing is going to trust the self-self, and also that whole double-star construction doesn't legally work for the domainnames [18:37:40] RECOVERY - DPKG on labmon1001 is OK: All packages OK [18:37:46] s/self-self/self-sign/ [18:37:59] I know the cert is not going to be trusted [18:38:08] I can't fix that [18:38:28] I'm just saying, it was broken before and it's still broken now [18:38:35] most http clients or libraries will still fail on it [18:38:39] *https [18:39:10] It's less broken now [18:39:18] (03CR) 10Dzahn: [C: 032] "true, not used by anything. mx servers and fermium all use virtual.cfg because they are VMs" [puppet] - 10https://gerrit.wikimedia.org/r/249677 (owner: 10John F. Lewis) [18:39:23] Krenair: I'm not really sure. add some hacky logging to see what $_SERVER is in beta? [18:39:29] but for now it would be perfectly ok to put a config setting in the config repo for https_for_restbase or whatever, and just set it to always use https in prod and always http in beta for now [18:39:48] it shouldn't matter what the outer connection that triggered the restbase call was doing with XFP before [18:39:51] We already have that [18:40:04] It means you can't use VE over HTTPS to beta [18:40:18] Because it always tries to connect to restbase over HTTP, which will be blocked [18:40:32] :) [18:40:40] what blocks it? most of everything works fine over HTTP in beta [18:40:45] Chrome [18:40:51] why? [18:41:00] is beta using prod's restbase? [18:41:01] Because you can't load non-HTTPS resources in an HTTPS page? [18:41:03] no [18:41:13] Krenair: uh, proto-rel? :P [18:41:16] I guess what I'm saying is, nothing on beta should be using HTTPS [18:41:25] all HTTPS there is broken in one way or another anyways [18:41:46] start with http://en.wikipedia.beta.wmflabs.org/ or whatever, and then why would anything else flip to https at all? [18:41:59] That goes against the point of beta and many tasks open against beta itself. [18:42:13] that HTTPS is broken in beta is a problem, yes [18:42:28] but the problem isn't fixed, not by far, and so we can't just bludgeon on acting like it is fixed when it's not [18:42:45] I'm not acting like it's fixed. [18:43:01] so, don't use HTTPS on beta [18:43:13] No. [18:44:00] We should fix it properly by getting real certs. [18:44:09] Be my guest, ostriches. [18:44:20] yeah I know, we have multiple open tasks about it, and a couple of pending solutions [18:44:28] but it's not fixed yet, and it won't be this week for sure heh [18:44:36] bblack: I know, I agree with you :) [18:44:48] (03CR) 10Dzahn: "we had pretty much the same discussion for the misc. services, now we say differently just for "mw" itself. i still think it is the nicer " [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [18:45:28] bblack: one issue is that browsers disable certain features (ex: crypto, serviceworkers) when using http, and we are starting to use on those features [18:46:33] so, at least longer term I think would be very desirable to support all-https in beta labs [18:46:59] also, I can't type today [18:51:29] 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1766074 (10Cmjohnson) @chasemp ping when it's a good time to do this. [18:54:33] 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1766086 (10Ottomata) Nice! @Jallemandou, @milimetric ^ [18:54:50] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:56:07] (03PS2) 10Dzahn: palladium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff) [18:56:32] (03CR) 10Dzahn: [C: 032] "yea, puppetmaster itself - but checked in compiler, there was no diff. link above" [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff) [18:57:52] Milimetric: thx for the update..i am going to take care of aqs1001 now if thats okay [18:58:12] cmjohnson1: thank you! /me is afraid of fire [18:58:19] (it's ok now yeah) [18:58:28] cool [18:59:25] !log powering down aqs1001 for h/w maintenance [18:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:39] PROBLEM - Host aqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:04:49] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [19:04:59] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [19:14:55] ok I've been derailed several times. now I'm about to start the train deploy for real. [19:15:30] RECOVERY - Host aqs1001 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [19:17:28] (03CR) 10Dzahn: "why this?" [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [19:18:54] 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1766176 (10Cmjohnson) The thermal paste was very crusty and caked on. Cleaned off and reapplied and the temps are much better now. Leaving open and will check back in 24-48 hours. cmjohnso... [19:19:00] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [19:19:11] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [19:19:30] (03PS2) 10Alex Monk: beta: Use protocol-relative link to connect to restbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249794 [19:25:11] ok really _really_ deploying the train now [19:25:46] (03PS1) 10Aaron Schulz: Fixed getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 [19:28:17] (03PS1) 1020after4: wikipedia wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249805 [19:29:31] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249805 (owner: 1020after4) [19:29:36] (03Merged) 10jenkins-bot: wikipedia wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249805 (owner: 1020after4) [19:31:27] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.27.0-wmf.4 [19:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:35] ok wmf.4 caused a significant increase in log errors: [19:37:39] https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [19:37:51] some are db errors. should probably roll back? [19:38:47] (03PS1) 1020after4: wikipedia wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249810 [19:39:22] (03CR) 1020after4: "rolling back to wmf.3 due to a moderately large increase in log errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249810 (owner: 1020after4) [19:39:31] yes [19:39:40] !log rolling back to 1.27.0-wmf.3 due to increase in log errors [19:39:45] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249810 (owner: 1020after4) [19:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:51] (03Merged) 10jenkins-bot: wikipedia wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249810 (owner: 1020after4) [19:40:25] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.27.0-wmf.3 [19:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:37] most of the errors were https://phabricator.wikimedia.org/T117084 [19:42:56] Cite extension [19:43:01] (for others playing along at home) [19:44:18] twentyafterfour: not sure of most? https://logstash.wikimedia.org/#dashboard/temp/AVC1IQYPptxhN1XafgCi [19:44:28] 4 in that view [19:49:04] (03PS1) 10Ori.livneh: Enable ScribuntoGatherFunctionStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249811 [19:49:16] (03CR) 10Ori.livneh: [C: 032] Enable ScribuntoGatherFunctionStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249811 (owner: 10Ori.livneh) [19:49:22] (03Merged) 10jenkins-bot: Enable ScribuntoGatherFunctionStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249811 (owner: 10Ori.livneh) [19:50:21] twentyafterfour: re. T117084, I guess it happens because "!$this->mRefs[$group]" is used instead of "!isset( $this->mRefs[$group] )"? [19:50:33] !log ori@tin Synchronized wmf-config/CommonSettings.php: I06879b6e6e: Enable ScribuntoGatherFunctionStats (duration: 00m 17s) [19:50:34] greg-g: see also https://phabricator.wikimedia.org/T117089 [19:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:40] (03PS3) 10Dzahn: etc,redis,dynamicproxy: fix some lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/249658 [19:50:57] ^ that second task describes what looks like a more serious problem [19:50:58] (03CR) 10Dzahn: [C: 032] etc,redis,dynamicproxy: fix some lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/249658 (owner: 10Dzahn) [19:51:03] the first one was just log spam [19:53:42] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1766328 (10yuvipanda) Bringing this back up, for T117081 [19:54:11] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1766331 (10yuvipanda) [19:55:18] ori: is that scribuntogather change you jus tmerged related to the errors during deploy? [19:58:09] (03PS1) 10Milimetric: Enable the limn-ee-data report runner [puppet] - 10https://gerrit.wikimedia.org/r/249812 [19:58:20] greg-g: which error? [19:58:37] ori: I guess not then: https://phabricator.wikimedia.org/T117089 [19:58:38] i'm not aware of it causing or relating to errors [19:58:46] yes, definitely not related to that [19:58:49] I'm just regexing "gather" right now :) [19:58:51] (03CR) 10Krinkle: webperf::navtiming additionaly store wikidata seperately (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) (owner: 10JanZerebecki) [19:58:58] yeah, no relation to Extension:Gather [19:59:10] kk, thanks [19:59:21] but dangit for not fixing the issue man :P [20:01:18] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 temporarily while discussing this https://gerrit.wikimedia.org/r/#/c/249697/" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) (owner: 10Yurik) [20:01:56] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Use require_package [puppet] - 10https://gerrit.wikimedia.org/r/249720 (owner: 10Muehlenhoff) [20:09:03] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1766404 (10BBlack) So far I haven't seen any fallout, but I'm not sure if there's any positive check to ensure whatever was going wrong with snmp isn't going wrong anymore. Hopefully it was just related... [20:09:37] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Abide $datadir [puppet] - 10https://gerrit.wikimedia.org/r/249718 (owner: 10Muehlenhoff) [20:09:45] (03PS2) 10Alexandros Kosiaris: openldap: Abide $datadir [puppet] - 10https://gerrit.wikimedia.org/r/249718 (owner: 10Muehlenhoff) [20:10:09] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [20:10:19] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [20:11:15] ^mysql db1008/lutetium is expected, acking icinga [20:11:29] (03CR) 10Alexandros Kosiaris: [V: 032] openldap: Abide $datadir [puppet] - 10https://gerrit.wikimedia.org/r/249718 (owner: 10Muehlenhoff) [20:11:55] (03PS2) 10Alexandros Kosiaris: openldap: Use require_package [puppet] - 10https://gerrit.wikimedia.org/r/249720 (owner: 10Muehlenhoff) [20:12:06] (03CR) 10Alexandros Kosiaris: [V: 032] openldap: Use require_package [puppet] - 10https://gerrit.wikimedia.org/r/249720 (owner: 10Muehlenhoff) [20:12:32] ACKNOWLEDGEMENT - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) Jeff_Green civi upgrade [20:13:11] ACKNOWLEDGEMENT - check_mysql on fdb2001 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) Jeff_Green civi upgrade [20:14:46] ACKNOWLEDGEMENT - check_mysql on fdb2001 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) Jeff_Green civi upgrade [20:14:52] (03CR) 10Yurik: [C: 04-1] "TileratorUI has significantly different way of running - it is a single instance per IP:port, which means that if i change source configur" [puppet] - 10https://gerrit.wikimedia.org/r/249697 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [20:15:19] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [20:16:50] (03PS1) 10Dzahn: openldap: fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/249817 [20:17:47] (03CR) 10Dzahn: [C: 032] openldap: fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/249817 (owner: 10Dzahn) [20:18:43] (03CR) 10Yurik: "Lastly, for extra security, TileratorUI database accounts could have additional rights like creating tables and keyspaces - when sources a" [puppet] - 10https://gerrit.wikimedia.org/r/249697 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [20:19:40] (03PS1) 10Dzahn: admin: add jzerebecki to deployers [puppet] - 10https://gerrit.wikimedia.org/r/249818 (https://phabricator.wikimedia.org/T116487) [20:20:19] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [20:20:50] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1766457 (10yuvipanda) 3NEW [20:22:09] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1766475 (10yuvipanda) [20:22:11] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1766476 (10yuvipanda) [20:22:57] (03CR) 10Muehlenhoff: "It's caused by a Gerrit merge conflict. It fails to rebase automatically and leaves these ">>>>" marks in the patch (which you're usually " [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:25:14] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1766481 (10chasemp) p:5Triage>3Normal a:3chasemp [20:25:19] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [20:25:58] (03PS2) 10Dzahn: Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:26:32] (03PS3) 10Dzahn: Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:27:42] (03PS4) 10Dzahn: Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:30:19] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [20:30:49] ACKNOWLEDGEMENT - check_mysql on lutetium is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) Jeff_Green civi upgrade [20:31:28] (03PS5) 10Dzahn: Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:33:27] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117, 3labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1766502 (10RobH) [20:33:37] (03CR) 10Dzahn: [C: 031] "fixed. works now. and compiled: http://puppet-compiler.wmflabs.org/1124/" [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:33:42] (03PS6) 10Dzahn: Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:35:50] (03CR) 10Dzahn: [C: 032] Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:37:23] (03PS1) 10Alexandros Kosiaris: exim: Add and use $::other_site to provide LDAP fallback [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) [20:39:42] (03PS2) 10Dzahn: Use testsystem role for einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/247237 (owner: 10Muehlenhoff) [20:41:44] twentyafterfour: it could simply be contention for the message cache table [20:41:49] was it diminishing over time? [20:41:57] or was the branch not deployed for long enough for it to stick? [20:42:24] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10RobH) 3NEW a:3RobH [20:42:40] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1766541 (10RobH) [20:42:41] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766540 (10RobH) [20:42:51] greg-g: ^ [20:43:51] (03CR) 10Dzahn: "was about to merge, except: Host einsteinium.eqiad.wmnet not found: 3(NXDOMAIN) ...uh? why is it in DNS and site.pp but i still can't res" [puppet] - 10https://gerrit.wikimedia.org/r/247237 (owner: 10Muehlenhoff) [20:43:59] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10RobH) [20:45:55] greg-g, twentyafterfour: I feel pretty safe in rolling that back, so I'll give reverting https://gerrit.wikimedia.org/r/#/c/249810/ a shot [20:46:02] (03PS1) 10Ori.livneh: Revert "wikipedia wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249871 [20:46:22] (03CR) 10Ori.livneh: [C: 032] Revert "wikipedia wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249871 (owner: 10Ori.livneh) [20:46:28] (03Merged) 10jenkins-bot: Revert "wikipedia wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249871 (owner: 10Ori.livneh) [20:47:31] (03Restored) 10Alexandros Kosiaris: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [20:47:31] !log ori@tin rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [20:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:20] (03Abandoned) 10Dzahn: Use testsystem role for einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/247237 (owner: 10Muehlenhoff) [20:50:21] (03CR) 10Alexandros Kosiaris: "Yes, better, but the dependent patch https://gerrit.wikimedia.org/r/#/c/249059/1 should be resolved and merged first. I 'll rebase and cha" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [20:50:45] (03PS2) 10Dzahn: zim: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/245874 (owner: 10Muehlenhoff) [20:50:51] (03CR) 10Dzahn: [C: 032] zim: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/245874 (owner: 10Muehlenhoff) [20:51:54] (03CR) 10Alexandros Kosiaris: [C: 031] Change all mysql servers to max_allowed_package = 32MB [puppet] - 10https://gerrit.wikimedia.org/r/249135 (owner: 10Jcrespo) [20:52:06] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766623 (10RobH) [20:52:18] (03CR) 10Dzahn: [C: 031] "+1, +Yuvi" [puppet] - 10https://gerrit.wikimedia.org/r/246295 (owner: 10Muehlenhoff) [20:53:46] (03PS2) 10Dzahn: wdqs: Remove superflous include [puppet] - 10https://gerrit.wikimedia.org/r/246307 (owner: 10Muehlenhoff) [20:54:24] (03CR) 10Dzahn: [C: 032] "confirmed. already included." [puppet] - 10https://gerrit.wikimedia.org/r/246307 (owner: 10Muehlenhoff) [20:55:55] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10RobH) [20:55:57] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1766662 (10RobH) 5Open>3Resolved a:3RobH Ok, so everything in the blocking network tasks states row B is labs in codfw as well (for now.) So I'm resolv... [20:56:23] !log restbase: switch local_group_wikiquote_T_title__revisions to Date-Tiered compaction (DTCS) [20:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:13] (03PS2) 10Dzahn: etherpad: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244706 (owner: 10Muehlenhoff) [20:57:19] wait what? [20:57:51] (03CR) 10Yuvipanda: [C: 031] "I guess the ultimate 'right' thing to do is to have a role specifically for flannel and move it there, but until then..." [puppet] - 10https://gerrit.wikimedia.org/r/246295 (owner: 10Muehlenhoff) [20:57:54] twentyafterfour: I moved us back to wmf4, because my understanding of the db errors was normal contention around the ResourceLoader tables around deploy time [20:58:00] twentyafterfour: I'm watching the logs, they should go down [20:58:06] (03PS3) 10Dzahn: etherpad: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244706 (owner: 10Muehlenhoff) [20:58:20] (03CR) 10Dzahn: [C: 032] etherpad: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244706 (owner: 10Muehlenhoff) [20:58:40] cf T117089 [21:00:09] (03CR) 10Dzahn: [C: 031] Include base::firewall in the phabricator role [puppet] - 10https://gerrit.wikimedia.org/r/247260 (owner: 10Muehlenhoff) [21:00:35] ori: ok what about T117084 [21:00:38] (03CR) 10Dzahn: [C: 031] labnet1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247205 (owner: 10Muehlenhoff) [21:01:03] (03CR) 10Dzahn: [C: 031] labnodepool1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247207 (owner: 10Muehlenhoff) [21:02:25] twentyafterfour: the empty() can be restored [21:02:28] not sure why that was changed [21:05:32] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766722 (10RobH) [21:05:40] 6operations, 10ops-codfw, 6Labs, 10Labs-Infrastructure: on-site tasks for labs deployment cluster - https://phabricator.wikimedia.org/T117107#1766723 (10RobH) 3NEW a:3Papaul [21:06:14] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/Cite/Cite_body.php: Revert "Avoid counting arrays if not needed" (duration: 00m 17s) [21:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:29] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10RobH) [21:09:40] AaronSchulz: because Krinkle (See gerrit comments) [21:10:17] https://gerrit.wikimedia.org/r/#/c/223250/ [21:12:09] yes, I saw that patch :) [21:15:03] mutante: thanks for clearing the patch queue :-) [21:15:37] moritzm: welcome:) [21:20:57] twentyafterfour: still working on the deploy? [21:21:23] tgr: I guess ori deployed it? [21:22:06] https://en.wikipedia.org/wiki/Special:Version says it's wmf.4 so yeah [21:22:17] can I grab tin for short time then? I want to deploy https://gerrit.wikimedia.org/r/#/c/249873/ [21:23:00] and https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor is still going crazy with errors [21:23:17] tgr go for it [21:24:22] AaronSchulz: So the only problem line is empty( $this->mRefs[$group] ), right? [21:24:25] Or rahter, that line needs to be restored [21:24:34] afaik [21:24:35] so the problem is that the index may not exist? [21:24:42] the others don't change anything really [21:24:47] Krinkle: looks like it [21:24:47] (code integrity problem) [21:24:56] OK. That seems fair. [21:25:03] It was all but obvious from the code that the reason it used empty is that [21:25:15] This is exactly why empty() is the worst invention since mayonaise. [21:25:21] It tells you nothing. [21:25:37] It does everything and you can't change it because maybe the one thing you didn't think it did is the reason it's there. [21:25:40] I'll restore with isset() [21:26:12] In fact, other code in teh Cite extension already uses isset() for that [21:26:15] unrealted to that patch [21:26:20] should've seen that, sorry [21:29:32] :) [21:29:55] twentyafterfour: This didn't cause fatal though right, just notice spike? [21:30:22] Krinkle: right just log spam [21:31:02] I do see a lot of other errors too [21:31:03] Fatal error: Call to undefined method RunningStat::push() in /srv/mediawiki/php-1.27.0-wmf.4/includes/libs/Xhprof.php on line 259 [21:31:43] that would be me, and would be fixed by a sync [21:31:55] it's not affecting users happily [21:32:07] ori: ok [21:32:26] (03PS6) 10Alexandros Kosiaris: WIP: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [21:32:28] (03PS2) 10Alexandros Kosiaris: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [21:32:29] !log ori@tin Synchronized php-1.27.0-wmf.4/includes/libs/Xhprof.php: (no message) (duration: 00m 18s) [21:32:33] (03CR) 10jenkins-bot: [V: 04-1] WIP: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [21:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:35] (03CR) 10jenkins-bot: [V: 04-1] (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [21:33:59] also: Invalid status code returned from AbortLogin hook: 11 [21:34:20] https://en.wikibooks.org/w/index.php?title=Special:UserLogin&returnto=User%3ASofiaTalbert875 [21:34:21] PROBLEM - puppet last run on mw2072 is CRITICAL: CRITICAL: Puppet has 5 failures [21:36:07] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1766786 (10EBernhardson) 3NEW [21:36:28] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1766793 (10EBernhardson) [21:36:45] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1766786 (10EBernhardson) [21:36:59] ori: https://gerrit.wikimedia.org/r/249890 [21:38:08] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1766786 (10EBernhardson) [21:40:54] !log tgr@tin Synchronized php-1.27.0-wmf.4/includes/MagicWord.php: T117066 (duration: 00m 18s) [21:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:43:02] (03PS7) 10Alexandros Kosiaris: WIP: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [21:43:04] (03PS3) 10Alexandros Kosiaris: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [21:48:59] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766838 (10atgo) I'm still not able to get in. @jgreen tried to help but it seems we're hitting a wall. Reopening pending this getting fixed. I'm available to d... [21:51:35] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766857 (10Krenair) Paste the output of `ssh -vvv stat1002`? We might be able to see what the problem is. [21:52:21] https://gerrit.wikimedia.org/r/#/c/249673/ could use a cherry pick [21:52:43] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1766867 (10akosiaris) Why not rely on `systemctl mask/unmask cassandra` ? We 've tested this approach in the maps cluster where we wanted maps-admins to be capable of disa... [21:57:53] ebernhardson: BadMethodCallException from line 390 of /srv/mediawiki/php-1.27.0-wmf.4/includes/specials/SpecialSearch.php: Call to a member function hasInterwikiResults() on a non-object (NULL) {"exception_id":"fc83a42d"} [21:57:54] known? [21:58:10] ori: see above [21:58:20] ah, great. [21:58:24] ori: i have a patch for it to be sent out in swat [21:58:34] I'll backport that, need to do another backport anyway [21:58:54] ebernhardson: might wanna do it sooner [21:59:14] feel free to fill them as tasks against #wikimedia-log-errors [21:59:28] (I meant: fill the backtrace as a task) [22:00:56] 6operations, 10Phabricator-Bot-Requests, 10procurement: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1766910 (10RobH) 3NEW a:3chasemp [22:01:39] RECOVERY - puppet last run on mw2072 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [22:01:57] 6operations, 10hardware-requests: Site: 1 server hardware access request for initializing the codfw elasticsearch cluster. - https://phabricator.wikimedia.org/T116236#1766919 (10RobH) 5stalled>3declined a:3RobH Update from IRC: @ebernhardson: robh: can close that ticket, we ended up adding mediawiki to... [22:02:03] !log restbase: switched local_group_default_T_parsoid_html to Date-Tiered compaction (DTCS) [22:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:42] it's wikidata's birthday. maybe can close out https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+message:wikidata+status:open,n,z ? [22:09:29] (03CR) 10Dzahn: [C: 031] ""The output of this version is fully compatible with bzip2" [puppet] - 10https://gerrit.wikimedia.org/r/249729 (owner: 10Hoo man) [22:09:36] (03PS2) 10Dzahn: Use pbzip2 -p3 to compress Wikidata JSON dumps on snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/249729 (owner: 10Hoo man) [22:09:58] (03PS5) 10MaxSem: Beta: use final agreed upon deployment scheme [puppet] - 10https://gerrit.wikimedia.org/r/248374 [22:09:59] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1766950 (10faidon) Looks fine so far; last time the effect was immediate so I think we'll be okay. What exactly was different than last time? It'd be nice to know what went wrong to avoid something like... [22:10:32] (03CR) 10Dzahn: [C: 032] Use pbzip2 -p3 to compress Wikidata JSON dumps on snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/249729 (owner: 10Hoo man) [22:11:48] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766954 (10atgo) OpenSSH_6.9p1, LibreSSL 2.1.7 debug1: Reading configuration data /Users/agomez/.ssh/config debug1: /Users/agomez/.ssh/config line 23: Applying o... [22:13:03] (03PS1) 10MaxSem: Fix exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/249905 [22:13:42] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766965 (10Dzahn) Hi, so the key part here seems this: debug1: identity file /Users/agomez/.ssh/analytics_rsa type 1 debug1: key_load_public: No such file or d... [22:15:26] (03CR) 10Dzahn: "snapshot1001: Notice: /Stage[main]/Packages::Pbzip2/Package[pbzip2]/ensure: ensure changed 'purged' to 'present'" [puppet] - 10https://gerrit.wikimedia.org/r/249729 (owner: 10Hoo man) [22:15:28] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766971 (10Krenair) I don't think you copied @dzahn's config, did you? Since your local username (`agomez`) is different to your remote username (`atgomez`), you... [22:15:37] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1766975 (10Eevans) [22:15:42] (03PS1) 10Rush: codfw labs row a reservations overlap [dns] - 10https://gerrit.wikimedia.org/r/249906 [22:16:38] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766979 (10atgo) If I do "cat /Users/agomez/.ssh/analytics_rsa" I get the RSA private key, so seems like it. [22:17:38] (03CR) 10Rush: [C: 032] codfw labs row a reservations overlap [dns] - 10https://gerrit.wikimedia.org/r/249906 (owner: 10Rush) [22:18:52] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766990 (10atgo) Here's the full text of my config, crafted by @jgreen {F2893684} [22:21:06] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766994 (10Jgreen) OH I SEE IT. (I think...) Host bast1001.wikimedia.org User atgomez IdentityFile ~/.ssh/analytics_rsa bad--> ProxyCommand /... [22:21:45] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766995 (10Krenair) Try removing the ProxyCommand line in the `bast1001.wikimedia.org` block that attempts to proxy connections to bast1001 through bast1001... [22:23:14] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1766997 (10Dzahn) krenair: What you said seemed right, though on bast1001, i checked logfiles, and i don't see an attempt from "agomez", just one from "atgomez",... [22:23:20] yurik: jhobs: I'll deploy https://gerrit.wikimedia.org/r/#/c/249874/ since its core commit got sandwiched between other core commits that need to be deployed [22:23:39] tgr, go for it [22:24:18] we probably should add a warning to the docs to not +2 extension cherry-picks before SWAT [22:24:48] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767001 (10Dzahn) >>! In T115666#1766994, @Jgreen wrote: > bad--> ProxyCommand /usr/bin/ssh -q -W %h:%p bast1001.wikimedia.org > oops. ^^^^ Yes, that'... [22:26:50] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767004 (10atgo) Success! I removed the ProxyCommand line and made it in to bast1001. Everything look OK from your side? [22:28:42] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767012 (10Dzahn) Yes, it does :) "Starting session: shell on pts/7 for atgomez" Now let's try if you can also jump via bast1001 straight to stat1002/1003. Jus... [22:28:43] (03CR) 10BryanDavis: [C: 031] Fixed getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [22:29:09] !log tgr@tin Synchronized php-1.27.0-wmf.4/includes/Category.php: c249890, fixes some warnings in production (duration: 00m 18s) [22:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:04] !log tgr@tin Synchronized php-1.27.0-wmf.4/includes/specials/SpecialSearch.php: c249899, fixes some warnings in production (duration: 00m 17s) [22:30:08] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1767026 (10Cmjohnson) I think the snmp failures were a direct result of using the wrong SFPs. Maybe Brandon noticed something else but that is the only change I made. [22:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:31:22] !log tgr@tin Synchronized php-1.27.0-wmf.4/includes/MagicWord.php: T117066, fixes some exceptions in production (duration: 00m 17s) [22:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:04] jouncebot: refresh [22:32:06] I refreshed my knowledge about deployments. [22:33:37] !log tgr@tin Synchronized php-1.27.0-wmf.4/extensions/ZeroBanner/includes/ZeroSpecialPage.php: T116821 (duration: 00m 17s) [22:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:48] yurik: ^ [22:34:17] tgr, thx, checking... [22:34:19] any point in deploying wmf3 now? [22:34:43] (03CR) 10Alex Monk: Fixed getMWScriptWithArgs() user error message (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [22:35:14] jhobs, ^^ [22:35:38] tgr, do we have any non-en wikis on it? [22:35:44] wikipedia [22:35:51] not other projects [22:36:15] the train for wikipedias was a couple hours ago [22:36:32] so unless some wiki is somehow managed separately, no [22:37:57] yurik: confused, what's the question? [22:38:04] looks like it's been deployed? [22:38:37] the wmf4 backport has been deployed, unless ZeroBanner does something unusual that should cover all wikis [22:39:33] the deployments page mentioned wmf3, just wanna make sure I'm not missing anything [22:40:00] I tested with enwiki and everything seems fine [22:40:08] and it should work the same across all wikis [22:40:11] so thanks tgr! [22:41:38] jouncebot: refresh [22:41:40] I refreshed my knowledge about deployments. [22:43:25] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767082 (10Dzahn) I see you also got on stat1002: stat1002 sshd[17449]: Starting session: shell on pts/16 for atgomez So claiming it's resolved again. Right? [22:43:34] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767083 (10Dzahn) 5Open>3Resolved [22:48:59] (03PS1) 10Rush: Allocate reserved labs-hosts1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/249914 [22:53:25] jhobs, yes, it has been deplyed, i'm still testing, tgr, up to you really - this code affects wikipedias only [22:53:46] jhobs, update, it works fine [22:54:19] PROBLEM - puppet last run on mw2027 is CRITICAL: CRITICAL: puppet fail [22:54:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 23.08% of data above the critical threshold [500.0] [22:54:31] yurik: yeah, scroll up, I already tested :) [22:58:20] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [22:59:55] (03PS2) 10Rush: Allocate reserved labs-hosts1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/249914 [22:59:57] (03PS1) 10Rush: Labs instance subnet allocation [dns] - 10https://gerrit.wikimedia.org/r/249919 [22:59:57] ori: https://gerrit.wikimedia.org/r/#q,249912,n,z https://gerrit.wikimedia.org/r/#q,249913,n,z [23:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151029T2300). Please do the needful. [23:00:04] AaronSchulz ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:25] I'll do it [23:00:34] Wait, I put my patch in the wrong one [23:00:43] :) [23:00:54] RoanKattouw: I can do it if you like [23:00:58] Note my unaddressed comment on one of them. [23:01:14] No, I just edit-conflicted and didn't resolve [23:01:20] * RoanKattouw fixes deployments page, then looks at Krenair's comment [23:01:21] Krenair: I just +2'd both of Erik's since I assumed they were fine; if you want to block please do now [23:01:32] (03PS2) 10Rush: Labs instance subnet allocation for Codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 [23:01:33] I haven't reviewed Erik's ones [23:01:36] oh ok [23:02:22] They don't look controversial, and I'm not about to go nitpicking in CirrusSearch or CSS stuff. [23:03:05] yeah no worries, I misunderstood [23:03:09] Not going to do Aaron's, because Krenair is disputing its accuracy and I don't know who/what is right [23:03:18] Getting Uncaught TypeError: Cannot read property 'insertRule' of undefined on nl.wikipedia.org [23:03:23] coming from extension/core scripts [23:03:29] causing TOC collapse to be broken [23:03:33] great [23:03:38] I haven't done anything yet though [23:03:40] Aaron's change itself is probably fine, but I'd prefer we fixed up the rest of the comment instead of just forgetting by merging the change. [23:03:41] But I'll look at that next [23:03:51] E.g. https://nl.wikipedia.org/wiki/Huiskat [23:04:03] load.php?modules=Spinner.... TMH... [23:04:25] thedj: ----^^ [23:05:39] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:06:05] 581 error: Call to undefined method RunningStat::push() in /srv/mediawiki/php-1.27.0-wmf.4/includes/libs/Xhprof.php on line 259 [23:06:21] MaxSem: from when? [23:07:57] Yeah, taht's one fixed an hour or so ago [23:08:02] (should be ) [23:08:29] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767202 (10atgo) Rad. Thanks guys. I'd also like to be able to use Sequel Pro: sequelpro.com Any kernels of wisdom you could share about setting that up? Thanks! [23:09:48] indeed, latest one was from 21:19:28. it's just the log is now in a reasonable state so sees 2 hours worth of data [23:09:56] good job, everyone! [23:12:17] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767208 (10Krenair) I suspect you're unlikely to get official support for it. Is the mysql CLI not good enough? I don't know about how access control works for... [23:12:41] *sigh* [23:13:04] ori: Why did you self-merge an unreviewed change into wmf4 at 4:01pm? [23:13:16] Now it's intertwined with the SWAT changes and I'm basically forced to deploy it [23:13:23] I offered to deploy it [23:13:30] and the other SWAT patches [23:13:32] Oh is that what you meant [23:13:37] yes [23:13:38] Right, OK [23:13:46] would you like me to? [23:13:54] Sorry, I'd forgotten you'd offered to do the SWAT [23:13:56] No, it's fine [23:14:02] I'm basically forced to deploy it [23:14:16] At some point I decided that in these cases, I'd revert instead. [23:14:47] Yeah, I think I would have done that if Ori hadn't been around and satisfactorily explained [23:15:01] yes, that is the correct response Krenair / RoanKattouw [23:15:19] Not going to get pressured into deploying something I don't understand. [23:15:20] the "be around" requirement is in effect even for roots :) [23:16:28] !log catrope@tin Synchronized php-1.27.0-wmf.4/resources/src/mediawiki.special/mediawiki.special.search.css: SWAT: styling tweaks for inline interwiki search (duration: 00m 18s) [23:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:16] !log catrope@tin Synchronized php-1.27.0-wmf.4/extensions/Scribunto/: Make the percentile threshold for slow function stats configurable (duration: 00m 18s) [23:17:20] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1767235 (10EBernhardson) setup an ssh tunnel, point sequel pro at the ssh tunnel. thats about all there is to it. [23:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:06] !log catrope@tin Synchronized php-1.27.0-wmf.4/extensions/Flow: Fix CAPTCHA rendering in RTL languages (duration: 00m 19s) [23:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:01] !log catrope@tin Synchronized php-1.27.0-wmf.4/extensions/CirrusSearch/: Fix unwritable cluster errors (duration: 00m 19s) [23:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:46] OK all done [23:21:35] !log radium: scheduled downtime, reinstalling [23:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:56] thanks RoanKattouw [23:23:29] RECOVERY - puppet last run on mw2027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:24:29] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/Scribunto/engines/LuaSandbox/Engine.php: I69e9218 (duration: 00m 18s) [23:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:08] grmbl, the Flow thing has a bug, I'll follow up [23:25:15] https://en.wikipedia.org/wiki/MediaWiki:Common.css 503 [23:25:30] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:25:44] CSS/JS pages are 503ing [23:25:59] i think it's me, sec [23:26:34] yes [23:26:46] fixed. headdesk. [23:28:49] (03PS1) 10EBernhardson: Initial cirrus configuration for language detection of search terms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249922 [23:29:16] can i fit one more into swat (can deploy it myself)? its a mediawiki-config patch to allow usage of the language detection in search (only via a special query string for the moment) [23:29:49] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [23:30:05] grumble [23:30:21] (03PS2) 10EBernhardson: Initial cirrus configuration for language detection of search terms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249922 [23:31:18] ebernhardson: Go for it [23:31:28] Then after that I need to deploy a fix for what I just broke [23:31:29] (03CR) 10EBernhardson: [C: 032] Initial cirrus configuration for language detection of search terms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249922 (owner: 10EBernhardson) [23:31:33] (03Merged) 10jenkins-bot: Initial cirrus configuration for language detection of search terms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249922 (owner: 10EBernhardson) [23:32:04] !log Restarted logstash on logstash1003; died with OOM error [23:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:48] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: https://gerrit.wikimedia.org/r/249922 (duration: 00m 17s) [23:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:18] sweet! https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=Jesenwang+flugplatz&fulltext=Search&cirrusAltLanguage=yes [23:34:41] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [500.0] [23:37:56] A couple of people reported issues briefly FYI [23:38:04] from -commons: Who killed Commons. [23:38:21] from -ops: Error: 503, Service Unavailable at Thu, 29 Oct 2015 23:25:20 GMT [23:38:37] from wikipedia-en: 18<McMatter> Who broke Wikipedia? [23:38:38] etc. [23:39:21] umm, Call to a member function getCPUUsage() on a non-o... [23:39:27] what patch to be undeployed? [23:39:34] RoanKattouw: see that in any of the ones you deployed? [23:39:38] I vaguely recall ori mentioning that function recently? [23:39:44] ori: getCPUUsage ? [23:39:55] Yeah I remember that too [23:40:00] My patch was JS only [23:40:01] yeah that was my patch earlier [23:40:16] i reverted it [23:40:34] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [23:40:40] ie: it's over [23:40:57] icinga is very slow in this regard [23:41:03] oh ok : [23:41:04] i'll write it up [23:41:18] thank you ori [23:41:22] i think it qualifies, sadly [23:42:44] 18<wctaiwan> so just now I got a BadMethodCall exception at Special:Contributions, but it seemed transient. [23:42:44] 18<wctaiwan> in case that's something folks want to know about [23:42:47] from -tech [23:42:56] (03PS1) 10Ori.livneh: Disable $wgScribuntoGatherFunctionStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249927 [23:43:01] That wouldn't be something Flow-related would it RoanKattouw? [23:43:19] (03CR) 10Ori.livneh: [C: 032] Disable $wgScribuntoGatherFunctionStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249927 (owner: 10Ori.livneh) [23:43:40] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:43:42] (03Merged) 10jenkins-bot: Disable $wgScribuntoGatherFunctionStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249927 (owner: 10Ori.livneh) [23:44:24] Ah, it's the same Scribunto error [23:44:26] !log ori@tin Synchronized wmf-config/CommonSettings.php: Ibac0d60bd: Disable ScribuntoGatherFunctionStats (duration: 00m 17s) [23:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:34] Why is Scribunto being called on Special:Contributions? [23:44:47] someone use a module in one of the messages there perhaps? [23:46:49] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:47:06] Argh the Jenkins queue is taking forever [23:47:12] because of a failing commit [23:47:39] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:12] another report: https://meta.wikimedia.org/w/index.php?diff=14379877&oldid=14350879&rcid=6950872 [23:54:50] also old [23:55:24] Who do I go to for replacing the english wikipedia logo for a day? [23:57:09] Cyberpower678, any deployer can change the logo [23:57:14] if they know how [23:57:28] did they reach consensus about that? [23:57:29] if it's for a day, wouldn't a CSS override via MediaWiki:Common.css be better? [23:57:38] https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/5_millionth_article_logo [23:57:48] It could also be 2 days? [23:57:58] if you think that's acceptable for a wiki enwiki's size, ori.... and they don't care about what shows on the locked down pages [23:58:41] we should probably weigh in [23:58:49] just a sec, let me look at the logo [23:58:59] (if you need a config change) step 1) get somebody to make a gerrit change anytime step 2) get it on the swat deploy calendar beforehand so you know the timing is right