[00:01:02] !log Tried to update scap to 1879fd4 (Add sync-l10n command for l10nupdate); trebuchet reported 0/483 minions completing fetch and 3/483 minions completing checkout [00:01:06] where is it actually defined? [00:01:19] * ori is having a hard time locating the redis instance def for trebuchet [00:01:55] * yuvipanda waits for ori to find out that this was using rdb1001 [00:02:06] no, it uses redis on tin [00:02:12] :) [00:02:45] it's not in the trebuchet module for sure [00:02:53] is it anywhere? I can't find it [00:03:33] I would have expected it in deployment::deployment_server [00:04:07] there are two ways to get a redis instance (afaik) -- redis::legacy, which is just the old redis class renamed, and redis::instance [00:04:19] maybe it's not puppetized [00:04:29] or in the trebuchet-trigger package? [00:05:13] nope [00:05:14] Depends: python (<< 2.8), python (>= 2.7), python-support (>= 0.90.0), python-git (>= 0.3.2.RC1) [00:06:45] oh, there it is [00:06:57] modules/role/manifests/deployment/server.pp [00:09:48] when I was debugging in beta cluster it looked like minions were being updated even when the returner wasn't reporting. I haven't poked about on prod to see if it has the same behavior [00:10:10] i tried deploying scap again just now [00:10:15] * bd808 really hates the magic in trebuchet [00:10:24] i got 2/483 completed fetch, 3/483 completed checkout [00:10:35] then again, and got 3/ 3/ [00:11:04] so it's just horribly flakey and unreliable [00:11:04] so maybe salt is just being that inconsistent? [00:11:07] yeah [00:13:42] bypassing salt as the transport and using dsh instead requires root I think to run 'salt-call' so I can't even work around it in prod [00:15:39] I think the releng plan to kill this is to package scap as a deb and then use the new deploy command to replace trebuchet everywhere [00:16:07] \o/ [00:17:28] interesting, i ran 'keys *' on redis on tin and there are a lot of keys that look like artifacts of past deployments [00:19:05] hmmm.. looks like wikidev is granted sudo rights for the necessary salt-call commands. Maybe I can figure out a hack [00:20:14] nope. the grant is only on the deploy server so that git-deploy can do what it does [00:20:35] what's the salt-call invocation again? [00:20:45] * ori runs MONITOR on redis while trying to trebuchet-deploy scap [00:20:58] sudo salt-call deploy.fetch 'scap/scap' && sudo salt-call deploy.checkout 'scap/scap' [00:21:41] running that on any mw host should pull down the latest deploy tag form the master [00:30:32] bd808: running that [00:33:11] bd808: ok, i ran it twice on all machines that have scap as a deployment target [00:33:26] or rather, i told salt twice to run it on all those machines, as a way of improving the odds that they were all hit [00:34:30] thanks. I guess we will find out if it worked if and when we decide to deploy https://gerrit.wikimedia.org/r/#/c/255916/ [00:34:51] (03CR) 10Ori.livneh: [C: 032] l10nupdate: replace ssh key with new scap script [puppet] - 10https://gerrit.wikimedia.org/r/255916 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [00:42:06] * bd808 nervously waits for ori to tell him that the changes work [00:42:19] puppet ran fine on tin [00:42:38] did you restart keyholder-proxy? [00:42:42] don't we need to wait for it to run on all the appservers, too? [00:42:58] I did [00:43:16] all the puppet changes should be for tin only (well tin and mira) [00:43:46] I'll try updating a stale branch and see what happens [00:44:27] heh. I made it so you can't update a stale branch [00:45:36] !log testing l10n cache rebuild as l10nupdate user [00:46:19] our bot is awol again after tha last netsplit [00:46:47] morebots? [00:46:52] wonder if I can poke that [00:47:21] hm, nope, not on the maintainers list [00:48:07] legoktm, ^ [00:49:56] !log bd808@tin sync-l10nupdate completed (1.27.0-wmf.7) (duration: 04m 37s) [00:50:35] ori: your salt trick worked pretty well; only 2 hosts are missing the new version of the script [00:50:44] which ones? [00:50:44] mw1099.eqiad.wmnet and mw1127.eqiad.wmnet [00:52:00] Krenair: i added you [00:53:03] [INFO ] Executing command '/usr/bin/git fetch' in directory '/srv/deployment/scap/scap' [00:53:03] [ERROR ] Command '/usr/bin/git fetch' failed with return code: 128 [00:53:03] [ERROR ] output: error: object file .git/objects/08/d9b1b48167abd00af9a3f8b8aee473200d4ee7 is empty [00:53:05] fatal: loose object 08d9b1b48167abd00af9a3f8b8aee473200d4ee7 (stored in .git/objects/08/d9b1b48167abd00af9a3f8b8aee473200d4ee7) is corrupt [00:53:07] that's mw1099 [00:53:47] i deleted the loose object [00:53:51] worked ok after that [00:53:54] now mw1127 [00:54:24] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1837481 (10GWicke) The basic issue is that 1007-9 are of a smaller spec for historical reasons, but yet have the same storage weights assigned to... [00:54:56] mw1127 has a corrupt file too [00:56:39] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1837482 (10bd808) [00:57:04] nuked the repo entirely on mw1127 and reran deploy.fetch / deploy.checkout, which fixed it [00:57:09] got it [00:57:17] ori: nice [00:57:23] !log test [00:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:57:39] bd808: wanna run it one more time just to have that good feeling of a clean run? [00:57:47] ori: trivial change to the log message in https://phabricator.wikimedia.org/D66 [00:58:27] seems like the log should match the script name [00:58:49] accepted [00:59:32] I'm not sure why arc land is so slow... [00:59:47] * bd808 twiddles thumbs [01:01:39] thanks Krenair [01:02:16] yw [01:04:37] !log testing l10n cache rebuild as l10nupdate user (take 2) [01:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:36] !log bd808@tin sync-l10n completed (1.27.0-wmf.7) (duration: 01m 19s) [01:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:50] w00t. no errors [01:06:11] I guess we will see how it works with l10nupdate-1 in a few hours [01:06:55] bd808: thanks for taking care of that [01:07:16] It wasn't the project I expected to work on this weekend but it was kind of fun [01:08:25] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1837485 (10bd808) ``` tin:/srv/mediawiki-staging (git master $) bd808$ sudo -u l10nupdate -n -- sudo -... [01:10:02] the failure of trebuchet to surface git repository corruption issues has bitten us before -- recently, in the case of Stas [01:10:28] you have to look at the detailed report output and then decode the exit codes [01:10:30] you joked then that trebuchet requires root but actually Stas has root on those machines [01:10:34] it's not very nice [01:10:39] it's just really obscure, root or not [01:11:40] ok, i'm off, bye! [01:11:45] o/ [01:12:32] Krenair: https://phabricator.wikimedia.org/T114971 is probably unblocked for you with these new l10nupdate-1 changes [01:14:35] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: puppet fail [01:20:39] (03CR) 10Bmansurov: [C: 04-1] "-1 to get your attention." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [01:21:18] (03PS1) 10BryanDavis: Provision ~/.gitconfig file for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255952 (https://phabricator.wikimedia.org/T119746) [01:22:37] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1837487 (10bd808) a:3bd808 [01:44:05] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:56:27] !log started `nodetool cleanup` on restbase1002 to get rid of unnecessary data from earlier 1001 decommission attempt [01:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:00:10] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:11:07] bd808, Reedy: another git pull of core failed [02:47:16] (03PS1) 10TTO: Enable $wgULSAnonCanChangeLanguage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255953 (https://phabricator.wikimedia.org/T58464) [03:06:27] (03PS1) 10TTO: Translate project namesapce for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255954 (https://phabricator.wikimedia.org/T118067) [03:51:45] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [03:55:44] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [04:31:52] (03CR) 10Ori.livneh: [C: 032] Provision ~/.gitconfig file for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255952 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [05:31:35] PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: No response from NTP server [05:35:25] PROBLEM - Labs LDAP on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:05] PROBLEM - salt-minion processes on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:06] PROBLEM - dhclient process on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:16] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:01:45] PROBLEM - Disk space on mw1002 is CRITICAL: DISK CRITICAL - free space: / 8071 MB (3% inode=95%) [06:15:55] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:31:04] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:25] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail [06:31:56] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:45] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:45] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:55] PROBLEM - Disk space on mw1002 is CRITICAL: DISK CRITICAL - free space: / 7933 MB (3% inode=95%) [06:38:59] !log Restarted statsv on hafnium [06:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:25] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:55] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:58:56] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:06] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:14] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:00:35] RECOVERY - Disk space on mw1002 is OK: DISK OK [07:13:26] (03CR) 10Reedy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/255952 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [07:25:26] (03PS1) 10Reedy: Remove unused MWMULTIDIR variable [puppet] - 10https://gerrit.wikimedia.org/r/255957 [07:35:49] (03PS1) 10Reedy: Reduce code duplication [puppet] - 10https://gerrit.wikimedia.org/r/255958 [08:18:23] (03PS6) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [08:19:16] (03CR) 10jenkins-bot: [V: 04-1] etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [08:26:12] (03Abandoned) 10Muehlenhoff: Bump fd ulimit for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/255681 (owner: 10Muehlenhoff) [09:04:40] addshore: oohh nice, thanks! (re: mw statsd patch) [09:07:09] No worries (: [09:18:16] 6operations, 7Graphite: 500 errors from graphite shouldn't be retried by varnish - https://phabricator.wikimedia.org/T119721#1837751 (10fgiunchedi) 5Open>3Resolved fixed by https://gerrit.wikimedia.org/r/#/c/255706/ [09:22:53] (03Abandoned) 10Filippo Giunchedi: graphite: add http referer ban capability [puppet] - 10https://gerrit.wikimedia.org/r/255695 (https://phabricator.wikimedia.org/T119718) (owner: 10Filippo Giunchedi) [09:26:50] 6operations, 10MediaWiki-extensions-UniversalLanguageSelector, 7I18n, 7Verified: ULS causes pages to be cached with random user language - https://phabricator.wikimedia.org/T43451#1837765 (10Nemo_bis) [09:26:55] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837766 (10jcrespo) ``` jynus@db1046:/srv$ df -h | grep /srv /dev/mapper/tank-data 1.4T 1.3T 106G 93% /srv jynus@db1046:/srv$ du -h --max-depth=2 691G ./sqldata/log 119M ./sqldata/mysql... [09:32:18] !log removing old snapshots from db1046 [09:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:30] 6operations, 7Database: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#1837795 (10jcrespo) CC'ing Isart, that was interested on maybe helping with mysql metrics. [09:47:29] <_joe_> thcipriani|afk twentyafterfour ostriches https://github.com/pressly/sup kind of interesting :) [09:48:43] _joe_: looks neat. I want to make scap able to run arbitrary commands in parallel ... at least it's been on my wish-list [09:49:01] <_joe_> twentyafterfour: that's called "dsh" :P [09:49:19] <_joe_> twentyafterfour: but yeah, I was thinking the interesting thing would be to check their features [09:49:52] you mean like querying facter facts? [09:50:53] or what kind of features? [09:50:56] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [09:51:39] <_joe_> well, dsh facter -p | grep [09:51:39] <_joe_> :P [09:51:45] right [09:52:18] Yeah, will some of the fact data eventually be stored in etcd or is that beyond the scope of plans for etcd? [09:53:25] we already have smart log filtering, we could apply the same filter system to facter data, or other sources of machine classification data [09:53:38] <_joe_> completely beyond the scope [09:53:38] <_joe_> if we want to have querable puppet facts/etc we should probably add such an interface to servermon [09:54:08] <_joe_> (servermon.wikimedia.org, I'm unsure if you have access) [09:54:46] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1837815 (10Aklapper) 5Open>3declined a:3Aklapper Declining as ops won't spend time on this (however if you have a patch ready in Gerrit for redirects please link to i... [09:55:13] I think only ops/root have servermon [09:55:23] well it would be neat if it was accessible somewhere without directly talking to all of the nodes, but we could still send the request to everyone and let the nodes that fail the filter just return a reject status and then ignore them [09:55:41] (03CR) 10Ori.livneh: "It would work on testwiki because testwiki responses are not cached in varnish. It would make more sense to target test2wiki, but then you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255953 (https://phabricator.wikimedia.org/T58464) (owner: 10TTO) [09:55:42] yeah, only ops have servermon access right now [09:55:48] Reedy: yeah, doesn't look like I have access [09:56:13] the idea is to open it to the public at some point [09:56:23] but first I have to hide all the private data [09:56:26] <_joe_> :) [09:56:30] and only display them to logged in users [09:56:37] _joe_: the current filtering just uses 'key == value' expressions: https://doc.wikimedia.org/mw-tools-scap/scap3/deploy_commands.html#deploy-log [09:56:48] sounds like effort, don't bother ;D [09:56:52] akosiaris: Any reason it couldn't be opened up to NDA etc? [09:57:22] Reedy: yes it has serial numbers displayed, warranty dates etc [09:57:25] <_joe_> twentyafterfour: seems nice [09:57:30] I have racktables access :P [09:57:36] <_joe_> twentyafterfour: how do you get your docs there? [09:57:38] <_joe_> nice! [09:57:40] But fair enough [09:57:44] that is not because of NDA though [09:57:52] No, true [10:00:55] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:01:13] !log performing schema change on db1046 (analytics master) [10:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:18] (03Abandoned) 10TTO: Enable $wgULSAnonCanChangeLanguage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255953 (https://phabricator.wikimedia.org/T58464) (owner: 10TTO) [10:04:58] (03CR) 10Ori.livneh: "Documented on https://wikitech.wikimedia.org/wiki/Test.wikipedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255953 (https://phabricator.wikimedia.org/T58464) (owner: 10TTO) [10:07:17] tto: I'm glad you agree. Thanks for poking at this, regardless. I think it would be a good feature to have so it's important that we keep trying. [10:08:18] ori: Thanks for your helpful comments. I too am a bit sad that we have never had proper multilingualism on wikis like commons. [10:09:08] _joe_: docs get published by a ci job [10:09:18] !log re-enabling cr2-eqiad:xe-5/2/0 and xe-5/2/1 [10:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:09:24] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [10:10:52] !log re-enabling OSPF over cr2-eqiad:xe-5/2/2 <-> cr1-ulsfo:xe-0/0/3.538 [10:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:57] (03PS1) 10Faidon Liambotis: Revert "Depool ulsfo, outage in progress" [dns] - 10https://gerrit.wikimedia.org/r/255967 [10:12:01] (03PS2) 10Faidon Liambotis: Revert "Depool ulsfo, outage in progress" [dns] - 10https://gerrit.wikimedia.org/r/255967 [10:13:21] (03CR) 10Faidon Liambotis: [C: 032] Revert "Depool ulsfo, outage in progress" [dns] - 10https://gerrit.wikimedia.org/r/255967 (owner: 10Faidon Liambotis) [10:15:04] !log reenable puppet on graphite1001 [10:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:45] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet last ran 2 days ago [10:17:45] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:19:24] PROBLEM - puppet last run on seaborgium is CRITICAL: Timeout while attempting connection [10:21:14] RECOVERY - dhclient process on seaborgium is OK: PROCS OK: 0 processes with command name dhclient [10:21:26] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.030 seconds response time [10:22:14] RECOVERY - salt-minion processes on seaborgium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:23:14] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:27:37] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1837846 (10Peachey88) 5declined>3Open Reopening, We discussed this earlier elsewhere, and was pointed out to me, that we should either fix the broken redirects, Or my s... [10:27:55] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures [10:29:34] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 75, down: 0, dormant: 0, excluded: 1, unused: 0 [10:35:50] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1837868 (10zhuyifei1999) 5Open>3Resolved a:3zhuyifei1999 No further reports of this bug AFAIK. Bug presumably resolved. [10:36:10] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1837871 (10zhuyifei1999) a:5zhuyifei1999>3BBlack [10:37:03] (03PS1) 10Filippo Giunchedi: fix --program description [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/255970 [10:37:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] fix --program description [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/255970 (owner: 10Filippo Giunchedi) [10:50:53] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1837890 (10Denniss) 5Resolved>3Open [10:51:34] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1816360 (10Denniss) See https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time%27s_magazine_11.jp... [10:52:05] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:56:05] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837894 (10jcrespo) This is a list of the first record on db1046 for each table: ``` mysql -A -BN -h db1046 log -e "SELECT table_name FROM information_schema.columns WHERE column_name='ti... [10:59:30] (03PS1) 10Muehlenhoff: Allow configurable LDAP indices in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255973 [10:59:44] !log upgrade python-statsd to 3.0.1 in codfw [10:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:03:15] !log upgrade python-statsd to 3.0.1 in eqiad [11:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:34] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Puppet has 1 failures [11:08:04] PROBLEM - DPKG on db1029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:08:05] PROBLEM - DPKG on db1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:10:13] that's me ^ [11:10:25] PROBLEM - DPKG on mc1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:11:16] PROBLEM - DPKG on mc1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:11:36] PROBLEM - DPKG on mc1012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:11:55] PROBLEM - DPKG on db1031 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:11:56] PROBLEM - DPKG on hydrogen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:12:13] godog: you missed a down arrow too ;) [11:12:24] (03PS2) 10Muehlenhoff: Allow configurable LDAP indices in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255973 [11:12:32] heheh [11:15:25] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:15:54] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Puppet has 1 failures [11:16:47] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1837949 (10Peachey88) [11:17:04] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#530760 (10Peachey88) [11:18:18] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1837961 (10jcrespo) [11:18:29] (03PS1) 10Muehlenhoff: Add LDAP index for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/255978 [11:18:34] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:51] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#530760 (10jcrespo) [11:20:31] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837970 (10ori) >>! In T119380#1830707, @jcrespo wrote: > I have just one question, when and who decides when new tables are to be created within a schema? At the moment it is done manuall... [11:21:34] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Puppet has 1 failures [11:23:45] RECOVERY - DPKG on hydrogen is OK: All packages OK [11:25:15] PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Puppet has 1 failures [11:27:57] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1837983 (10Ricordisamoa) [11:31:56] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1837986 (10jcrespo) > If we get agreement on T119144, we could potentially drop the clientIp column (varchar(191)) from all tables. Dropping columns is not an investment work persuing. Par... [11:32:14] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:33:25] PROBLEM - DPKG on gadolinium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:33:45] PROBLEM - DPKG on db1058 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:34:05] PROBLEM - DPKG on mc1010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:34:25] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:34:37] ^ should be recovering shortly, next puppet run [11:35:15] PROBLEM - puppet last run on mc1013 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:47] dpkg: dependency problems prevent configuration of python-statsd [11:38:59] python-statsd depends on python:any (>= 2.7.1-0ubuntu2). [11:39:48] jynus: yeah that's me, fixing ATM [11:40:02] ok, if you are on it, no issue [11:40:19] although it may have affected diamond [11:41:18] yup, while diamond is running it shouldn't bother if it disappears from the fs [11:41:26] RECOVERY - DPKG on db1001 is OK: All packages OK [11:42:54] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:43:04] PROBLEM - puppet last run on gadolinium is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:25] RECOVERY - DPKG on db1029 is OK: All packages OK [11:45:15] RECOVERY - DPKG on gadolinium is OK: All packages OK [11:45:24] RECOVERY - DPKG on db1031 is OK: All packages OK [11:45:24] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:45:55] RECOVERY - DPKG on mc1010 is OK: All packages OK [11:46:05] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:46:14] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:46:35] RECOVERY - DPKG on mc1011 is OK: All packages OK [11:46:55] RECOVERY - puppet last run on gadolinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:46:58] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:47:06] RECOVERY - DPKG on mc1012 is OK: All packages OK [11:47:06] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:47:45] RECOVERY - DPKG on mc1013 is OK: All packages OK [11:49:04] RECOVERY - puppet last run on mc1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:49:25] RECOVERY - DPKG on db1058 is OK: All packages OK [12:47:25] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000000.0] [12:50:19] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1838047 (10faidon) 1) tin has those two: ``` root@tin:~# cat /etc/sudoers.d/l10nupdate... [12:59:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [13:04:49] (03PS1) 10Zhuyifei1999: Toollabs bastion: install GNU automake [puppet] - 10https://gerrit.wikimedia.org/r/255988 (https://phabricator.wikimedia.org/T119870) [13:08:29] (03CR) 10Merlijn van Deen: [C: 031] Toollabs bastion: install GNU automake [puppet] - 10https://gerrit.wikimedia.org/r/255988 (https://phabricator.wikimedia.org/T119870) (owner: 10Zhuyifei1999) [13:08:34] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 16.00% of data above the critical threshold [100000000.0] [13:09:05] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [13:09:07] apergos: maint-announce is in a terrible state [13:09:22] did you triage it (with your clinic duty hat on)? [13:09:27] I better clean that up before the next person gets it [13:10:00] I usually sort that at the end of the week [13:10:16] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:55] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:26] PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.974 second response time [13:12:44] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [13:13:54] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:12] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1838086 (10Reedy) For 1 and 2 - l10nupdate-sync looks like the new one as it's for a sc... [13:15:35] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:54] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:14] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:14] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:46] (03CR) 10Faidon Liambotis: [C: 04-1] "Copied and updated from https://gerrit.wikimedia.org/r/#/c/80973/" [dns] - 10https://gerrit.wikimedia.org/r/239072 (owner: 10Faidon Liambotis) [13:16:46] PROBLEM - nutcracker port on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:46] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:04] PROBLEM - Disk space on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:16] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:24] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:25] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:35] PROBLEM - Check size of conntrack table on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:35] PROBLEM - nutcracker process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:45] PROBLEM - HHVM processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:54] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:23:55] RECOVERY - DPKG on mw1147 is OK: All packages OK [13:24:10] 6operations, 10netops: add new saltmaster (neodymium) to network exceptions for labvirt* etc hosts - https://phabricator.wikimedia.org/T119512#1838094 (10faidon) p:5Triage>3Normal [13:24:31] 6operations, 10netops: add new saltmaster (neodymium) to network exceptions for labvirt* etc hosts - https://phabricator.wikimedia.org/T119512#1838096 (10faidon) a:3faidon [13:24:34] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [13:24:34] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [13:24:35] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 6 processes with command name hhvm [13:24:44] RECOVERY - Disk space on mw1147 is OK: DISK OK [13:24:46] 6operations, 10netops: add new saltmaster (neodymium) to network exceptions for labvirt* etc hosts - https://phabricator.wikimedia.org/T119512#1838097 (10faidon) 5Open>3Resolved Done. [13:25:04] RECOVERY - configured eth on mw1147 is OK: OK - interfaces up [13:25:05] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:25:14] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [13:25:15] RECOVERY - Check size of conntrack table on mw1147 is OK: OK: nf_conntrack is 0 % full [13:25:15] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:26:25] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [13:29:55] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [13:33:00] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1838102 (10Aklapper) a:5Aklapper>3None [13:36:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [13:41:15] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:46:42] (03CR) 10Nikerabbit: [C: 04-1] CX: Use ContentTranslationRESTBase (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [13:47:04] (03PS2) 10Filippo Giunchedi: diamond: send statsd metrics in batches [puppet] - 10https://gerrit.wikimedia.org/r/254873 (https://phabricator.wikimedia.org/T116033) [13:47:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: send statsd metrics in batches [puppet] - 10https://gerrit.wikimedia.org/r/254873 (https://phabricator.wikimedia.org/T116033) (owner: 10Filippo Giunchedi) [13:51:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [13:51:45] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [13:52:24] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [13:52:40] !log updating varnishkafka on cp1065 [13:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:15] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:59:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [13:59:45] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [14:04:51] thanks for the fixup for neodymium, para void [14:05:16] (03PS6) 10KartikMistry: CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) [14:07:14] (03PS1) 10Filippo Giunchedi: diamond: batch statsd metrics in production [puppet] - 10https://gerrit.wikimedia.org/r/255993 (https://phabricator.wikimedia.org/T116033) [14:07:23] !log upgrading varnishkafka package on all caches [14:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:31] (03PS1) 10Faidon Liambotis: reprepro: add an experimental component [puppet] - 10https://gerrit.wikimedia.org/r/255994 (https://phabricator.wikimedia.org/T119519) [14:08:32] apergos: ^ anything to handoff ? [14:08:47] ah [14:08:49] hi godog [14:09:26] hey [14:10:26] (03PS2) 10Faidon Liambotis: reprepro: add an experimental component [puppet] - 10https://gerrit.wikimedia.org/r/255994 (https://phabricator.wikimedia.org/T119519) [14:10:34] there's an access request which we need to discuss at the ops meeting I guess [14:10:38] and I have one pending to do [14:10:42] nothing for you! [14:11:11] apergos: ok, thanks! [14:11:18] yw [14:12:38] (03CR) 10Faidon Liambotis: [C: 032] reprepro: add an experimental component [puppet] - 10https://gerrit.wikimedia.org/r/255994 (https://phabricator.wikimedia.org/T119519) (owner: 10Faidon Liambotis) [14:13:34] PROBLEM - DPKG on lvs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:13:36] PROBLEM - DPKG on lvs1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:14:06] PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:14:13] I think that's you godog [14:14:16] confusingly enough [14:14:24] since I just changed lvs' apt config :) [14:14:49] paravoid: ah! looking [14:15:25] RECOVERY - DPKG on lvs1002 is OK: All packages OK [14:15:27] I think it's ok now [14:15:33] it was just upgrading or something [14:15:41] indeed, I just launched dpkg -a --configure on 1001 and it worked [14:15:53] no it's me heh [14:16:04] but same solution either way [14:16:06] RECOVERY - DPKG on lvs1001 is OK: All packages OK [14:16:07] oh ok :) [14:16:25] I keep forgetting about lvs100[123] still being precise when I do lvs* commands :P [14:16:53] which, they probably shouldn't be anyways at this point, just got scared off of finish the upgrade by some log messages before [14:17:41] (03PS2) 10BBlack: webrequest: Add X-Client-IP -> client_ip [puppet] - 10https://gerrit.wikimedia.org/r/253472 (https://phabricator.wikimedia.org/T118557) [14:19:28] (03CR) 10BBlack: [C: 032] "vk 1.0.7 deployed to address the config line length limit" [puppet] - 10https://gerrit.wikimedia.org/r/253472 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [14:21:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [14:21:25] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [14:25:17] <_joe_> I should probably raise that threshold a bit [14:26:27] (03PS7) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [14:26:29] (03PS1) 10Giuseppe Lavagetto: etcd: add client configuration facility [puppet] - 10https://gerrit.wikimedia.org/r/255998 [14:27:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [14:27:25] RECOVERY - DPKG on lvs1003 is OK: All packages OK [14:28:51] 6operations, 10Analytics, 10Traffic, 5Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838151 (10BBlack) With varnishkafka-1.0.7 deployed and the patch above merged, the webrequest stream now has a correct "client_ip" field that analytics... [14:29:24] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [14:29:58] 6operations, 10Analytics, 10Traffic, 5Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838153 (10BBlack) (and, I just read @ottomata's comment above - we can certainly switch the data into "ip" instead of "client_ip". That might be simple... [14:31:37] !log switching traffic from lvs3002 to lvs3004; upgrading lvs3002's kernel [14:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:07] 6operations, 10Analytics, 10Traffic, 5Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838154 (10JAllemandou) No problem for me to remove and reuse ip, and remove x_forwarded_for :) [14:34:00] bblack: We need to look at updating the MW TrustedXFF extension [14:34:14] PROBLEM - pybal on lvs3002 is CRITICAL: Timeout while attempting connection [14:34:15] PROBLEM - puppet last run on lvs3002 is CRITICAL: Timeout while attempting connection [14:34:28] Reedy: ok, it looks at XFF directly I guess? [14:35:13] Reedy: re the stuff in the ticket above, we'd be removing xff data from an analytics pipeline, but the header is still there unchanged in the actual app requests [14:35:44] paravoid was going to file a task about updating MW, but not sure if he got round to doing it [14:35:58] I did not, I was waiting to coordinate w/ Brandon first [14:36:25] so TrustedXFF is the one with the huge and probably outdated proxy lists I guess [14:37:00] heh, yeah [14:37:24] we should probably transition to maintaining similar but better data in the zero/meta database, with community help [14:38:04] RECOVERY - pybal on lvs3002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [14:38:05] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:38:08] which will get the new XCIP / XTP / etc headers working nicer, and then swap out TrustedXFF for just some MW code/extension that says "trust this header name as the real client IP, don't decode XFF" [14:38:17] !log switching traffic back to lvs3002 [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:33] https://github.com/wikimedia/mediawiki-extensions-TrustedXFF/commits/master/trusted-hosts.txt [14:38:37] That was last updated Jan 2014 [14:39:05] bblack: (CPU increase) http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=vl100-eth0.lvs3002.esams.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=LVS+loadbalancers+esams [14:39:13] ttp://meta.wikimedia.org/wiki/XFF_project [14:39:41] Reedy: https://phabricator.wikimedia.org/T89838 is about moving the database we're using in varnish now from a private wiki for zero to meta.wm.o where we can get broader help maintaining it [14:39:42] disappointing isn't it [14:40:18] yeah but, within acceptable limits all things considered, I think [14:40:24] yeah sure [14:40:26] would be nice to get some comparison profile data and see what it is [14:41:00] (03PS1) 10ArielGlenn: add matmarex to researchers, statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) [14:41:12] yeah I tried, we don't have perf built for 4.3 yet :) [14:41:27] linux-tools-foo [14:42:40] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1838166 (10zhuyifei1999) >>! In T119038#1837890, @Denniss wrote: > See https://commons.wikimedia.org/wiki/File:Jennifer_Winget... [14:43:12] PROBLEM - MariaDB Slave Lag: s7 on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:43:14] (03PS1) 10BBlack: webrequest: move client_ip data to legacy "ip" field [puppet] - 10https://gerrit.wikimedia.org/r/256002 (https://phabricator.wikimedia.org/T118557) [14:44:15] PROBLEM - puppet last run on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:31] PROBLEM - MariaDB Slave IO: s7 on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:51] checking, there is a load problem there [14:45:14] PROBLEM - RAID on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:21] !log switching traffic from lvs3001 to lvs3003; upgrading lvs3001's kernel [14:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:40] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1838168 (10ArielGlenn) Hmmm looking at the staff and contractors page, it looks like @Wwes is the manager (please correct me... [14:46:14] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:24] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [14:46:42] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1838171 (10ArielGlenn) Well let's discuss it today at the ops meeting and assuming no one minds then it will be on the record for the future too. [14:48:15] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 88.45 ms [14:48:31] RECOVERY - MariaDB Slave IO: s7 on db1034 is OK: OK slave_io_state Slave_IO_Running: Yes [14:50:42] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1838173 (10BBlack) Regardless, even if the bug appears "resolved" now (at least, back to normal rare levels of occurrence), my... [14:51:04] RECOVERY - RAID on db1034 is OK: OK: optimal, 1 logical, 2 physical [14:51:10] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1838174 (10Denniss) geographic location may play a role here but then we would have a data sync issue somewhere. The thumb fro... [14:53:12] RECOVERY - MariaDB Slave Lag: s7 on db1034 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [14:57:52] !log switching traffic from lvs4001 to lvs4003; upgrading lvs4001's kernel [14:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:44] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:24] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 79.93 ms [15:02:34] !log switching traffic back to lvs4001 [15:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:58] !log switching traffic from lvs4002 to lvs4004; upgrading lvs4002's kernel [15:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:48] (03PS1) 10Isart: adding diamond collector to send P_S metrics to graphite [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256007 [15:07:36] 6operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#1838196 (10faidon) This happened just now, after a reboot: ``` Nov 30 15:04:10 lvs4001 kernel: [ 169.562309] CPU7: Core temperature above threshold, cpu clock throttled (total events = 1) Nov 30 15:04:10... [15:09:24] 6operations, 7Database: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#1838198 (10Isart) I've added the following Diamond collector to send P_S metrics to graphite. Not sure how to set the user/pass, so I've set it to `$user` and `$password` on the co... [15:10:03] <_joe_> !log restarting hhvm on mw1114, stuck in __pthread_cond_wait () [folly::EventBase::runInEventBaseThreadAndWait ()], apparently blocked in writing to stdout [15:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:44] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.083 second response time [15:11:55] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 70118 bytes in 0.574 second response time [15:12:03] <_joe_> !log restarting HHVM on mw1147 too, same reason as mw1114 [15:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:10] (03CR) 10Reedy: [C: 04-1] "<%= @password_variable %>" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256007 (owner: 10Isart) [15:13:50] !log switching lvs2001/2/3 traffic to lvs2004/5/6 and upgrading kernels [15:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:55] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:44] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 48.96 ms [15:21:33] !log switching lvs2004/5/6 traffic back to lvs2001/2/3 [15:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:11] (03CR) 10Alex Monk: [C: 04-1] "See my first comment on the task about groups" [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [15:25:15] 6operations: Build Linux 4.3 for jessie-wikimedia - https://phabricator.wikimedia.org/T119519#1838214 (10faidon) [15:25:32] PROBLEM - LVS HTTPS IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:26:10] PROBLEM - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:26:20] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:26:23] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#1838219 (10faidon) [15:26:24] 6operations: Build Linux 4.3 for jessie-wikimedia - https://phabricator.wikimedia.org/T119519#1838217 (10faidon) 5Open>3Resolved I created a component called "experimental" for suite jessie-wikimedia and included both linux and firmware-linux in there. Role lvs::balancer is already configured to include it,... [15:26:59] paravoid: ^ ? [15:27:20] PROBLEM - LVS HTTP IPv6 on misc-web-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:27:25] maybe pybal started up badly again? [15:27:27] PROBLEM - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:27:29] paged as well fyi [15:27:39] (03PS1) 10Isart: fixing user/pass on MySQL_PS template [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256008 [15:27:57] only ipv6 hmmm [15:28:15] PROBLEM - LVS HTTP IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:28:33] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#1838221 (10faidon) lvs2001/2/3, lvs3001/2, lvs4001/2, i.e. all primaries, were upgraded to the new kernel. The backups were left with 3.19 on purpose, as an easy fallback in case something goes wrong with t... [15:28:38] err [15:28:44] what the hell [15:29:02] I'll stop pybal while we investigat [15:29:31] done [15:29:35] RECOVERY - LVS HTTP IPv6 on misc-web-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 334 bytes in 0.078 second response time [15:29:37] !log stopping pybal on lvs2001/2/3 [15:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:42] RECOVERY - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 972 bytes in 0.168 second response time [15:29:53] I still haven't gotten *any* pages [15:30:30] RECOVERY - LVS HTTPS IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10176 bytes in 0.283 second response time [15:30:31] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1838223 (10BurritoBazooka) For me, the first one linked above in zhuyifei's comment looks like this: {F3029094} Thumbnail is... [15:30:49] RECOVERY - LVS HTTP IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 591 bytes in 0.077 second response time [15:31:01] paravoid: it seems aql rean out of credits from 500 to 0 in a single weekend. [15:31:09] RECOVERY - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15077 bytes in 0.357 second response time [15:31:11] * robh is making breakfast but is also watching irc [15:31:29] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 393 bytes in 0.162 second response time [15:31:34] so it seems keeping a 500 credit balance isnt enough (i just refilled it last week) [15:32:04] bblack: what do you mean by "started up badly"? [15:33:06] ipvsadm -L shows backends just fine [15:34:53] PROBLEM - pybal on lvs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:34:54] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:35:22] paravoid: i just topped off aql you'll start getting pages again [15:35:33] (03PS4) 10coren: Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) [15:35:35] i'll clear with mark a pre-approval to put more htan 500 credits on there (500 seemed enough) [15:35:41] in the future [15:35:53] (03PS1) 10Muehlenhoff: openldap: Make slapd.conf 0440 [puppet] - 10https://gerrit.wikimedia.org/r/256009 [15:35:54] PROBLEM - pybal on lvs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:36:04] we ran on 500 credits for 2 months, and then for less than a week. [15:37:04] paravoid: I meant the issue we had before, which we patched I think, where it didn't configure some services [15:38:16] paravoid: https://gerrit.wikimedia.org/r/#/c/255555/4 splits out cleanup into a salt-invokable script instead [15:38:34] in any case, the services look right... [15:40:03] I restarted it again, and stopped it again [15:40:13] I didn't lose a single ping [15:40:17] however, it does seem racy [15:40:24] [ 1310.977162] IPVS: sh: TCP 208.80.153.224:443 - no destination available [15:40:26] paravoid: the ifup race is still happening [15:40:27] (03CR) 10Jhobs: Disable QuickSurveys reader segmentation survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [15:40:27] [ 1310.977383] IPVS: wrr: TCP 208.80.153.224:80 - no destination available [15:40:29] etc. [15:40:32] let me fix them... [15:40:59] bblack: yeah, it happened only in codfw, I was debugging this before [15:41:05] (03PS2) 10Jhobs: Disable QuickSurveys reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) [15:42:07] paravoid: fixed on lvs2001-3 [15:42:17] pybal is racy I think, it announces the route before it has actually configured realservers [15:42:24] (via "ifup eth1; ifup eth2", which were the two not showing RPS IRQ pattern correctly in /proc/interrupts) [15:42:53] RECOVERY - pybal on lvs2001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [15:44:01] paravoid: it used to always wait quite a while before it started advertising. maybe that came from some newer change... [15:44:52] RECOVERY - pybal on lvs2002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [15:45:26] basically you don't even have to look at /proc/interrupts, you can just run through "ifup ethX" for all the ethX. The ones that were already working right will say "ifup: interface eth2 already configured" [15:45:43] RECOVERY - pybal on lvs2003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [15:46:06] 6operations, 10Salt, 5Patch-For-Review: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1838234 (10ArielGlenn) btw neodymium is jessie, as will be any other new syndics or masters. See T115287 [15:46:21] ok, now it all works.. [15:46:38] (03CR) 10BBlack: [C: 032] webrequest: move client_ip data to legacy "ip" field [puppet] - 10https://gerrit.wikimedia.org/r/256002 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [15:46:59] PROBLEM - LVS HTTPS IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:47:09] oh ffs [15:47:11] <_joe_> uh, what's up? [15:47:30] PROBLEM - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:47:39] (03Abandoned) 10BBlack: webrequest: remove "ip" field [puppet] - 10https://gerrit.wikimedia.org/r/253473 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [15:48:10] PROBLEM - LVS HTTP IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:48:17] it pings from here [15:48:20] _joe_, paravoid is having some "fun" with lvss on codfw [15:48:27] (03PS18) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:48:40] PROBLEM - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is CRITICAL: No route to host [15:48:43] and 443 works too [15:49:03] why only ipv6? [15:49:20] RECOVERY - LVS HTTPS IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10176 bytes in 0.265 second response time [15:49:39] <_joe_> paravoid: yeah I was looking at lvs2001 and it's seeing connections and everythign [15:49:56] I'm suspecting that only a portion of the backends is unreachable [15:49:59] RECOVERY - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15077 bytes in 0.316 second response time [15:50:08] (I stopped pybal again to investigate) [15:50:43] RECOVERY - LVS HTTP IPv6 on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 544 bytes in 0.079 second response time [15:50:44] it seems like ipv6 for basically all of the services doesn't work on the new boxes in codfw, for whatever reason [15:50:51] at least, from icinga's POV [15:51:01] RECOVERY - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 971 bytes in 0.173 second response time [15:51:25] <_joe_> paravoid: I see why you think just one backend is reachable [15:51:38] and yet I can't confirm, it seems to work [15:51:43] for all rows [15:52:16] so, problem or not ? [15:52:48] so at this point, let's (a) fail out codfw in config-geo completely then (b) start up pybal and let the ipv6 errors persist so we can spend some time really looking before we switch back? [15:53:35] PROBLEM - pybal on lvs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:53:42] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1838252 (10ArielGlenn) Faidon fixed up the network issue for the labvirt hosts so they are now reachable from neodymium. Requested a new project in gerrit for our salt builds. [15:53:43] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:54:25] <_joe_> bblack: so pybal can't reach backends via ipv6? [15:54:28] paravoid: (also, I wonder if puppet has run on them all since the ifup fixes, and whether that affects something not obvious) [15:54:34] PROBLEM - pybal on lvs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:54:36] _joe_: not sure yet [15:54:42] I've disabled puppet, because puppet starts pybal [15:54:47] yeah I know [15:54:48] (and btw, I hate that) [15:54:56] it's configurable [15:55:19] well the BGP part, anyways [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151130T1600). Please do the needful. [16:00:05] jhobs kart_: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:16] yo [16:00:29] here [16:00:30] so counters for non-eth0 interfaces are 0 [16:00:40] sorry, this was too terse [16:01:03] ipvsadm connection counters for backends that are reachable via eth1/2/3 subinterfaces are 0 [16:01:17] hmmm [16:01:24] just on v6? [16:01:30] <_joe_> yes [16:01:30] but pybal reaches them just fine, which is why it's not depooling them [16:01:38] yup, just ipv6 [16:01:58] (03PS19) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [16:02:17] could still be related to the ifup race somehow [16:02:36] again, maybe after resolving the ifup race, puppet needs running again, or we need to ipvsadm -C while pybal is down, etc? [16:03:15] 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1838285 (10hashar) [16:03:42] manual pings go via the correct interface [16:03:47] I'm suspecting an ipvs bug :( [16:03:59] seems to be working fine in ulsfo/esams, but yeah they're single-interface [16:04:02] yup [16:04:30] it's worth checking if it's related to the ifup race, though [16:04:41] I don't see how? [16:04:47] jhobs: kart_ looking at SWAT stuff now. seeing a huge error blow up in logs... [16:04:58] yikes [16:05:04] :( [16:05:07] Function already defined: wmfLoadInitialiseSettings in /srv/mediawiki/wmf-config/CommonSettings.php on line 184 [16:05:08] oh [16:05:30] 7Blocked-on-Operations, 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1838293 (10hashar) I have complet the ports for Trusty and Jessie with: https://gerrit.wikimedia.org/r/256018 https://gerrit... [16:05:42] 7Blocked-on-Operations, 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1838295 (10hashar) [16:05:47] just trying to root out some explanation before continuing. [16:05:49] checking the ulsfo/esams 4.3 hosts to see if any of them had the single-interface version of the race (it does happen) [16:05:57] hrm [16:06:09] "lo" only has two service IPs [16:06:19] oh, we only have two configured heh [16:06:25] yeah [16:06:31] lots of IP trimming the past several months :) [16:06:49] yeah, I got confused by seeing a page long of servers, but it was just 2x(http, https) [16:07:10] (annoying how ipvsadm doesn't resolve ipv6) [16:07:15] wmfLoadInitialiseSettings function already defined started started right at midnight UTC yesterday looking at logstash. [16:08:39] all the 4.3 in esams/ulsfo seem to have ifup'd correctly, or you already manually fixed them [16:08:48] no, they ifup'ed correctly [16:08:53] I checked, though [16:09:37] anyways, I still think we should depool codfw in DNS if we want to debug this further. [16:09:52] either that or we revert kernels on codfw lvs200[123] and give up [16:10:13] (for now) [16:10:16] heh.. [16:10:18] well [16:10:26] I'll revert lvs2003's kernel in any case [16:10:34] (03CR) 10Reedy: [C: 04-1] "Please amend the previous change and then abandon this one :)" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256008 (owner: 10Isart) [16:11:04] oh that doesn't have a single IPv6 service IP [16:11:16] heh [16:11:22] I was about to say, geodns wouldn't depool it from internal services, so that needs a revert [16:11:29] but yeah, we also don't do ipv6 for internal services heh [16:12:03] for those following along on the log spam that's 3 orders of magnitude higher than anything else: https://phabricator.wikimedia.org/T119880 [16:12:43] also, FWIW, it seems to only be happening on mw1002 [16:12:51] huh [16:13:00] I have to run, bus ride almost over [16:14:04] bblack: I'll revert for now to get back our redundancy [16:14:11] ok [16:14:15] (03PS2) 10BBlack: webrequest: remove X-Forwarded-For [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) [16:14:35] goddammit Linux [16:14:49] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1838327 (10bd808) As far as I know, the l10nupdate user is only used to [[https://phabr... [16:14:53] 4.2 crashed, 4.3 consumes double the CPU *and* fails in strange ways [16:15:06] <_joe_> lol [16:15:25] it's clearly time to switch everything to AIX [16:15:34] PROBLEM - puppet last run on lvs2003 is CRITICAL: Timeout while attempting connection [16:16:54] Also, on mw1002, the specific line of code that seems to be throwing the error in CommonSettings.php seems to be the same as what is working on all other servers. Could be HHVM just needs a kick on that box. [16:17:30] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1838344 (10Milimetric) So there seem to be two threads here. Table level partitioning seems to me to complicate replication to the slaves and complicate application logic. It doesn't seem... [16:17:45] !log rolling back to kernel 3.19 on lvs2001/2/3 [16:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:03] PROBLEM - puppet last run on lvs2001 is CRITICAL: Timeout while attempting connection [16:18:03] PROBLEM - puppet last run on lvs2002 is CRITICAL: Timeout while attempting connection [16:18:20] (03PS1) 10BBlack: statsv: switch "ip" field to X-Client-IP like webrequest [puppet] - 10https://gerrit.wikimedia.org/r/256020 [16:19:27] (03PS2) 10BBlack: statsv: switch "ip" field to X-Client-IP like webrequest [puppet] - 10https://gerrit.wikimedia.org/r/256020 (https://phabricator.wikimedia.org/T118557) [16:19:49] 6operations, 10Analytics, 10Traffic, 5Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1838361 (10BBlack) Ok, the old "ip" field now has the X-Client-IP data in the webrequest logs. The remaining pending patches here are: the (updated) on... [16:20:15] bblack: btw, I suspect that by using "allow-hotplug eth2" instead of "auto eth2" these races may be fixed [16:20:33] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:34] worth testing :) [16:21:03] (which I can test easily on lvs1007-12 without touching any of the currently-live stuff) [16:21:17] awesome :) [16:21:53] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [16:22:33] RECOVERY - pybal on lvs2003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:22:52] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1838382 (10jcrespo) @Milimetric: deleting data will not solve immediately the problem, as deleting data logically doesn't mean space is freed from disk. Hence the partitioning suggestion.... [16:23:24] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 38.51 ms [16:23:32] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1838389 (10fgiunchedi) @gwicke the plan makes sense I think, I can start right away and there should space on the rest to accomodate the decomissi... [16:23:54] thcipriani: couldn't hurt [16:23:58] bd808: never having used it, is scap-hhvm-restart a thing that works/could be used to restart mw1002 without breaking everything? [16:24:31] thcipriani: eh... probably not [16:24:59] bd808: kk, do I just need an opsen for that then? [16:25:07] if the server already hosed? [16:25:55] kinda looks like it from fatalmontior. If you block just that server, fatalmonitor is totally readable. [16:26:13] without blocking logs from that server you see a ton of: Function already defined: wmfLoadInitialiseSettings in /srv/mediawiki/wmf-config/CommonSettings.php on line 184 [16:26:24] Oh! scap-hhvm-restart is the local only part. Yeah it is realtively safe [16:26:49] but getting a root to restart hhvm is probably safer [16:26:57] just in case things get wonky [16:27:10] !root [16:27:12] :) [16:27:17] ok, back with 3.19 and restored redundancy [16:27:25] (03CR) 10Bmansurov: Disable QuickSurveys reader segmentation survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [16:27:30] what's up? [16:27:39] restart hhvm where? [16:27:41] could I get an opsen to do a restart on mw1002? Fatalmonitor looks like it's busted. [16:27:56] *fatalmonitor makes it looks like mw1002 is busted. [16:28:03] https://phabricator.wikimedia.org/T119880 [16:28:08] for graphs and such [16:28:16] the part of scap-hhvm-restart that doesn't work as hoped is the pybal depooling. Somedayℱ that will be replaced with some etcd magic [16:28:35] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1838414 (10jcrespo) BTW, I found the acceleration issue: the automatic purge process was failing since some tables had been deleted. [16:28:49] uhm [16:28:55] shotgun approach? [16:29:23] I'll restart.. [16:29:54] 1.5M errors from one host in a hour seems like cause for a kick [16:30:08] !log mw1002 service hhvm restart [16:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:18] I disagree, but whatevs [16:30:24] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Update tag and racktables for holmium: rename to labservices1002. - https://phabricator.wikimedia.org/T119533#1838437 (10Andrew) [16:31:08] errors dropped to 0 [16:31:39] That "Function already defined" smells like hhvm cache problems [16:31:53] seems as though. thanks paravoid ! [16:32:04] "hhvm cache problems"? :) [16:32:15] now I can get back to SWAT :) jhobs kart_ still around? [16:32:37] thcipriani: yep, although I'm quickly making one tiny update to my patch [16:32:44] (removing a redundant array) [16:33:06] (03PS3) 10Jhobs: Disable QuickSurveys reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) [16:33:09] aaand done [16:33:35] (03CR) 10Jhobs: Disable QuickSurveys reader segmentation survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [16:35:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [16:35:37] (03Merged) 10jenkins-bot: Disable QuickSurveys reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [16:36:00] thcipriani: I'm here too. [16:36:04] 6operations, 6Labs, 10wikitech.wikimedia.org, 7IPv6: Set IPv6 PTR for wikitech-static - https://phabricator.wikimedia.org/T103621#1838478 (10chasemp) p:5Triage>3Lowest [16:36:09] kart_: kk [16:36:15] thcipriani: thanks for following :) [16:36:34] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1838483 (10chasemp) p:5Triage>3Normal [16:36:45] 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1838485 (10chasemp) p:5Triage>3Normal [16:38:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable QuickSurveys reader segmentation survey [[gerrit:255448]] (duration: 00m 28s) [16:38:45] ^ jhobs check please [16:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:48] looks good, thanks thcipriani! [16:39:57] jhobs: thanks for checking [16:40:03] kart_: kk, you're up :) [16:41:37] great! [16:43:13] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1838544 (10ArielGlenn) I see no one's tried to get onto stat1002 or anything yet; as soon as you folks verify... [16:43:46] 6operations, 6Labs: RDNS for 10.68.18.65 resolves to two different instances - https://phabricator.wikimedia.org/T115194#1838553 (10chasemp) p:5Triage>3High [16:44:04] 6operations, 6Labs, 10Labs-Team-Backlog: Make sure that the 'secret' repo in self hosted puppetmasters is back-upable - https://phabricator.wikimedia.org/T115177#1838556 (10chasemp) p:5Triage>3Low [16:44:32] 6operations, 6Labs: Cleanup / clarify labstore2001 - https://phabricator.wikimedia.org/T116972#1838562 (10chasemp) p:5Triage>3High [16:44:45] 6operations, 6Labs, 10Labs-Infrastructure: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#1838563 (10chasemp) p:5Triage>3High [16:45:09] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1838565 (10fgiunchedi) >>! In T112421#1817275, @Andrew wrote: > I have new backports ready for testing. Can I get a volunteer to install them on a beta-clu... [16:49:43] 7Puppet, 6Labs, 10MediaWiki-extensions-OpenStackManager, 7Documentation: doc.wikimedia.org puppet documentation for labs/lvm/srv.html gives a 404 - https://phabricator.wikimedia.org/T119329#1838597 (10chasemp) p:5Triage>3Lowest [16:51:25] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Update tag and racktables for holmium: rename to labservices1002. - https://phabricator.wikimedia.org/T119533#1838609 (10chasemp) p:5Triage>3Normal [16:51:50] 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1838613 (10chasemp) p:5Triage>3High [16:51:58] !log thcipriani@tin Synchronized php-1.27.0-wmf.7/extensions/ContentTranslation/modules/draft/ext.cx.draft.js: SWAT: Add some extra information to save failure logging [[gerrit:255956]] (duration: 00m 28s) [16:52:01] ^ kart_ check please [16:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:04] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#1838615 (10faidon) There is an issue with blackholed cross-interface (i.e. traffic destined to eth1/2/3 realservers) IPv6 traffic. lvs2001/2/3 were rolled back to 3.19, pending further debugging... This is... [16:52:07] 6operations, 10Traffic, 5Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1838616 (10BBlack) a:3BBlack [16:52:21] 6operations, 10Traffic, 5Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825233 (10BBlack) p:5Triage>3Normal [16:53:26] thcipriani: okay! [16:54:49] thcipriani: also. we will get that 'extra info' in saving/restore failure of EventLogging. so, I can only 'really' check tomorrow :) [16:54:55] thcipriani: Thanks! [16:55:00] kart_: kk, thanks. [16:57:22] eh. in EL. not of EL :) [16:58:38] (03PS1) 10BryanDavis: Clean up l10nupdate settings (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/256025 (https://phabricator.wikimedia.org/T119746) [16:58:40] (03PS1) 10BryanDavis: Clean up l10nupdate settings (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) [17:03:43] PROBLEM - NTP on lvs2002 is CRITICAL: NTP CRITICAL: No response from NTP server [17:03:45] (03CR) 10BryanDavis: [C: 04-1] "I7729207ae17386c06a0382de72f98a4e023bb0ad makes fixing this correctly a lot easier. That patch removes the l10nupdate user from the normal" [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [17:06:45] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1838695 (10bd808) The changes in https://gerrit.wikimedia.org/r/256025 and https://gerr... [17:08:07] (03CR) 1020after4: [C: 031] mediawiki: specify uid 10002 for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [17:11:28] (03CR) 10Reedy: Clean up l10nupdate settings (2/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [17:13:56] (03CR) 10Jhobs: "This is still gated by an opt-in flag within beta, right? And that option is disabled by default?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [17:15:44] (03CR) 10BryanDavis: Clean up l10nupdate settings (2/2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [17:17:32] 7Puppet, 6Labs, 7Documentation: Missing documentation for labs puppet roles - https://phabricator.wikimedia.org/T91770#1838767 (10chasemp) p:5Triage>3Lowest [17:23:36] 6operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1838802 (10chasemp) p:5Triage>3High [17:27:23] !log demon@tin Synchronized php-1.27.0-wmf.7/extensions/WikimediaMaintenance/: need maint script errywhere (duration: 00m 28s) [17:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:23] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1838846 (10Dzahn) has been approved in meeting just now [17:29:26] 6operations, 6Labs, 10wikitech.wikimedia.org, 7HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1838847 (10chasemp) p:5Triage>3Lowest [17:35:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [150.0] [17:35:35] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [17:38:35] (03PS20) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [17:40:12] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1838900 (10Wwes) Lindsey Anne Frankenfield will be his manager but she is just starting. @TrevorParscal is technically his... [17:43:34] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [17:43:50] 6operations, 6Labs, 10wikitech.wikimedia.org: Expand list of people who can create new Labs project - https://phabricator.wikimedia.org/T101688#1838923 (10chasemp) p:5Triage>3Low [17:44:43] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1838932 (10chasemp) p:5Triage>3High What's going on with this? Title makes it seems like no proper testing has been done. [17:46:00] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1838943 (10chasemp) So...should we close this as invalid or ? [17:46:42] I think the cond that __pthread_cond_wait () that HHVM is waiting on is an upgraded package ;) [17:47:03] 6operations, 6Labs: Icinga alert for labnet1001 for conntrack saturation graphite check - https://phabricator.wikimedia.org/T101980#1838954 (10chasemp) p:5Triage>3Normal [17:48:39] (03PS1) 10Alexandros Kosiaris: etherpad: Remove some old ignored settings [puppet] - 10https://gerrit.wikimedia.org/r/256032 [17:48:56] 6operations, 6Labs, 10Tool-Labs-tools-Other: Move geohack to production - https://phabricator.wikimedia.org/T102960#1838962 (10chasemp) p:5Triage>3Low [17:49:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [17:49:46] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1838966 (10Milimetric) a:3Ottomata [17:50:16] 6operations, 6Labs: Have a cron job delete files that haven't been modified in the last X days / months in /data/scratch - https://phabricator.wikimedia.org/T103084#1838968 (10chasemp) p:5Triage>3Low [17:50:20] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old {hawk} - https://phabricator.wikimedia.org/T118527#1838970 (10Milimetric) p:5Triage>3High [17:50:25] (03PS2) 10ArielGlenn: add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) [17:51:11] (03CR) 10ArielGlenn: "sorry about that. fixed." [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [17:58:51] <_joe_> twentyafterfour, thcipriani I'm probably be running late [17:59:02] _joe_: kk [17:59:04] <_joe_> ops meeting is running late [18:00:40] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring: labstore monitoring: NRPE: Command 'check_cleanup-snapshots-labstore-state' not defined - https://phabricator.wikimedia.org/T111211#1839029 (10chasemp) 5Open>3Invalid a:3chasemp [18:02:27] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1839042 (10bd808) >>! In T119746#1836257, @Reedy wrote: > So this will fix it, but I do... [18:02:52] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1839054 (10coren) No, we still haven't done serious testing of that hardware (viz. T101471) so at best labstore1002 is a dubiously reliable backup atm. This task re... [18:04:43] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1839066 (10coren) >>! In T101741#1838932, @chasemp wrote: > What's going on with this? Title makes it seems like no proper testing ha... [18:07:21] (03PS3) 10Dzahn: contint: grant zuul-merger sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [18:10:10] (03CR) 10Dzahn: "re: journalctl. I amended this for consistency with other services, we tried before to limit this to a specific unit but that either doesn" [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [18:11:11] (03PS4) 10Dzahn: contint: grant zuul-merger sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [18:12:44] (03CR) 10Dzahn: [C: 032] "was approved in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [18:14:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1839104 (10Dzahn) I amended the change to: [18:15:56] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: Remove CI root access from scandium - https://phabricator.wikimedia.org/T116921#1839111 (10Dzahn) [18:15:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1839109 (10Dzahn) 5Open>3Resolved a:3Dzahn [18:16:16] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1828516 (10Dzahn) a:5Dzahn>3Andrew [18:17:37] (03PS2) 10Dzahn: admin: remove CI root access from scandium [puppet] - 10https://gerrit.wikimedia.org/r/254130 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [18:18:00] 10Ops-Access-Requests, 6operations: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1839123 (10Dzahn) [18:18:52] (03CR) 10Dzahn: [C: 032] "this is the rarest thing. reverse access requests where the user says they don't need it anymore. thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/254130 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [18:22:33] (03PS4) 10Chad: Elastic: move merge_threads to hiera [puppet] - 10https://gerrit.wikimedia.org/r/207377 [18:24:01] (03CR) 10Chad: "PS4 separates the cleanup bit which should be able to stand on its own. Bumping the number of merge_threads in production may or may not b" [puppet] - 10https://gerrit.wikimedia.org/r/207377 (owner: 10Chad) [18:24:21] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1839133 (10mcruzWMF) Hi, please add me as a project creator, as I have to work on projects across teams, and be able to make some of those pri... [18:26:49] (03PS7) 10Chad: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 [18:29:45] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1839168 (10RobH) [18:30:51] 7Blocked-on-Operations, 6operations, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1839174 (10fgiunchedi) a:3Joe [18:31:06] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1839176 (10Aklapper) >>! In T706#1839133, @mcruzWMF wrote: > please add me as a project creator, as I have to work on projects across teams, a... [18:31:38] (03CR) 10Chad: [C: 031] "https://puppet-compiler.wmflabs.org/1385/ works just fine, only change is where the merge_threads are being set, actual config output rema" [puppet] - 10https://gerrit.wikimedia.org/r/207377 (owner: 10Chad) [18:32:27] (03Draft1) 10Addshore: Allow wdqs on port 9999 for stat100[23] [puppet] - 10https://gerrit.wikimedia.org/r/256039 [18:33:52] Hi! Can anyone give me some insight about what mediawiki-config/dblsits/arbitraryaccess.dblist is about? [18:35:04] AndyRussG: at a guess its a list of wikis that have wikidata arbitrary acces enabled [18:35:09] AndyRussG, wikis in that list have wmgWikibaseEnableArbitraryAccess set to true [18:36:22] addshore: Krenair: hmm cool... K any idea of what performance/stability implications might be? I'm trying to figure out the possible level of risk of enabling Wikidata on Meta right before the big FR campaigns... Wikidata folks are pusing for it, but I'm hesitant... [18:36:32] https://gerrit.wikimedia.org/r/#/c/255063 [18:36:52] its stable and performs fine :) [18:37:02] AndyRussG, why are you worrying about this instead of them? [18:37:08] Not like it's your choice... [18:37:30] Heh well I told them we're in code freeze and they keep asking, so if I'm gonna argue for a no, I guess I have to justify it [18:37:58] * ostriches gets out some blowtorches to thaw the freeze :p [18:38:17] Meta db stores all the data about campaigns and banners, so if something takes that down, all our FR banners could go flbrprrrrzz [18:38:36] 6operations, 6Labs: Cleanup / clarify labstore2001 - https://phabricator.wikimedia.org/T116972#1839239 (10chasemp) [18:38:57] "freeze" [18:39:20] Or say if something locks up master for Meta db, it could hinder changes in the Admin UI for campaigns? [18:39:21] AndyRussG, well metawiki is on s7, are you also worried about frwiktionary and the wikipedias on that database? [18:39:31] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1839246 (10chasemp) 5Open>3Invalid a:3chasemp >>! In T119541#1833712, @akosiaris wrote: > I setup a new self hosted puppetmaster environment today and I did not meet this pro... [18:39:36] database cluster* [18:40:21] Krenair: I think wikipedias have had this for a while? It's just to activate this now on Meta and a few others [18:40:38] The only one I am looking at is Meta, since that's what controls banners across all our sites [18:40:40] AndyRussG: I mean, its already deployed on 589 or 892 wikis [18:40:49] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1839254 (10chasemp) >>! In T117673#1813406, @Joe wrote: > FTR. this just happened to me with a newly-created instance with jessie; to my knowledge no prior machine with that name existed and... [18:40:50] Right [18:41:40] I have just seen cases of extensions not playing nicely together. And the CentralNotice code that runs on Meta doesn't run anywhere else [18:41:43] addshore: ^ [18:42:51] it sounds like wikis in s7 already run this code, so if it took out a database master there stuff would break on meta already [18:42:59] (03PS21) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [18:43:15] That's 'cause CentralNotice runs in "infrastructure" mode on Meta but in "subscribing" mode everywhere else [18:43:36] Krenair: OK so that's good to hear. So sounds like db issues wouldn't be a concern [18:44:09] do you know how the wikis are grouped in terms of database hosting AndyRussG? [18:44:29] Krenair: no I'm not familiar with much of that, now [18:44:31] no [18:44:38] ottomata: do you know about the templating capabilities of systemd? [18:44:47] ori, not really no [18:44:51] i should read some more fancieness [18:45:02] i can explain quickly if you like [18:45:04] k [18:45:06] hit meh! [18:45:21] (jsut found this, am reading along... {errorobjecthere}, null, {errorobject2here}]) [18:45:22] oops [18:45:25] http://0pointer.de/blog/projects/instances.html [18:45:53] yeah [18:46:00] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: Remove CI root access from scandium - https://phabricator.wikimedia.org/T116921#1839276 (10Dzahn) Merging that change above did **NOT **remove the actual access by itself. This was needed as well: root@scandium:/etc/sudoers.d# rm contint-roots [18:46:00] Krenair: though also for Meta there are different database tables that don't exist elsewhere... Still, I suppose if the wikidata db code is stable, that probably shouldn't make any difference [18:47:21] I think it's a wikidata query service that's being enabled, so hopefully it's code that doesn't do any write queries? [18:47:22] AndyRussG, look at the sectionLoads sections of each of the wmf-config/db-*.php files [18:47:36] the keys are dblist file names [18:47:53] ottomata: so, tldr: you create, /lib/systemd/system/eventlogging-forwarder@.service , with an ExecStart=eventlogging-consumer @/etc/eventlogging.d/$i.conf [18:47:57] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: Remove CI root access from scandium - https://phabricator.wikimedia.org/T116921#1839281 (10Dzahn) 5Open>3Resolved [18:47:59] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1839282 (10Dzahn) [18:48:01] AndyRussG: well, now db schema changes are going to be made, so nothing to worry about there [18:48:03] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1839284 (10fgiunchedi) I tried adding `role::puppet::self` to an existing trusty host and looks like it has worked [18:48:42] values are arrays of database servers hosting that group of wikis (in which keys are hostnames and values are weight) [18:48:44] AndyRussG: also it is not the query service that is being enabled, but arbitrary access (which allows you to pull data from wikidata from items not directly linked to the page) [18:49:03] (03PS1) 10Yuvipanda: labstore: Kill NFS for orgcharts project [puppet] - 10https://gerrit.wikimedia.org/r/256042 (https://phabricator.wikimedia.org/T103137) [18:49:04] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1839292 (10Dzahn) 5Open>3Resolved the remaining blockers are closed. you can now *celebrate* @hashar [18:49:21] ottomata: then if you do systemctl start eventlogging-forwarder@legacy-zmq , systemd will load the unit file and replace '%i' with 'legacy-zmq' [18:49:35] ah coool [18:49:46] ottomata: next step is to get those things to run automatically. you do that with a symlink [18:49:53] that is nice...! 1 systemd file for all el daemons [18:50:15] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1839299 (10Dzahn) [18:50:15] and if they need a special override, can do with a specifically named @instance [18:50:20] /etc/systemd/system/multi-user.target.wants/eventlogging-forwarder@legacy-zmq.service -> /lib/systemd/system/eventlogging-forwarder@.service [18:50:21] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1179083 (10Dzahn) [18:50:41] yeah [18:51:19] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1839303 (10hashar) Thank you #operations ! [18:51:20] i did that for redis with https://gerrit.wikimedia.org/r/#/c/253146/ , but then yuvipanda (wisely) changed it to one-service-file-per-instance so that it is easier to do upstart / systemd back-compat using service::unit [18:51:25] Hmmm [18:51:27] but for eventlogging it might be fine to go systemd only [18:51:44] yeah, we can drop backwards compat for that [18:51:46] hmmm [18:51:53] (03PS2) 10Yuvipanda: labstore: Kill NFS for orgcharts project [puppet] - 10https://gerrit.wikimedia.org/r/256042 (https://phabricator.wikimedia.org/T103137) [18:52:11] (03CR) 10Yuvipanda: [C: 032 V: 032] "DIEDIEDIEDIEDIEDIEDIEDIEDIDEIDEIDEIDEIDIE" [puppet] - 10https://gerrit.wikimedia.org/r/256042 (https://phabricator.wikimedia.org/T103137) (owner: 10Yuvipanda) [18:53:15] (03PS1) 10Addshore: WDQS add X-Served-By header to response [puppet] - 10https://gerrit.wikimedia.org/r/256043 (https://phabricator.wikimedia.org/T119508) [18:55:24] addshore: Krenair: K so if I understand correctly if master has an issue on a given DB cluster, it blocks master for all DBs on that cluster? [18:55:40] ori, off the top of your head, do you know of a puppetized base::service_unit that uses templates? [18:56:04] 6operations, 10RESTBase, 10procurement: Get some Samsung 850 Pro 1T spares - https://phabricator.wikimedia.org/T119659#1839321 (10RobH) The disk upgrades mentioned by @gwicke are for restbase1007, restbase1008, and restbase1009, all Dell R430s. [18:56:26] (03PS2) 10Addshore: Allow wdqs on port 9999 for stat100[23] [puppet] - 10https://gerrit.wikimedia.org/r/256039 [18:58:47] AndyRussG, maybe you should speak to jynus about this? [18:59:33] ottomata: nope [19:02:27] https://gerrit.wikimedia.org/r/#/c/253146 converted redis::instance to use templates [19:02:27] we changed it back afterwards [19:02:27] but if you check out that patch you'll see how it worked [19:02:27] k cool [19:02:27] danke [19:02:28] perfect, makes sense [19:02:28] Krenair: yeah good idea thanks! [19:02:28] (03PS1) 10Chad: pep8: fix up some style warnings in mysql/skrillex.py misc/demux.py [puppet] - 10https://gerrit.wikimedia.org/r/256045 [19:04:32] 6operations, 10ops-eqdfw, 10RESTBase: check for spare disk bays in restbase1007-1009 - https://phabricator.wikimedia.org/T119896#1839350 (10RobH) 3NEW a:3Cmjohnson [19:04:32] 6operations, 10ops-eqdfw, 10RESTBase: check for spare disk bays in restbase1007-1009 - https://phabricator.wikimedia.org/T119896#1839358 (10RobH) [19:04:50] its stable and performs fine :) [19:04:56] Allow me to disagree [19:05:00] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, 7Swift: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1839361 (10aaron) a:5aaron>3None [19:05:44] jynus: ? [19:05:56] (03PS1) 10Chad: tcpircbot: fix up pep8 warnings about imports [puppet] - 10https://gerrit.wikimedia.org/r/256047 [19:06:18] jynus: this is that patch that wikidata folks want to deploy right before the fundraiser https://gerrit.wikimedia.org/r/#/c/255063 [19:06:38] wikidata shows basically its age by having weekly db issues that unless it is an unbreak now "we may be looking at it later" [19:06:41] It's enabling stuff that's already enabled elsewhere but that would be going to CentralNotice for the first time now [19:06:48] Sorry I mean, going to Meta [19:06:54] (03CR) 10Ori.livneh: [C: 04-2] "Setting the default encoding is significant to mainspace code in subsequently-imported modules; it has to be done at the top." [puppet] - 10https://gerrit.wikimedia.org/r/256047 (owner: 10Chad) [19:07:11] Which is the only wiki running CentralNotice in infrastructure mode and where the db queries run that control banners everywhere [19:07:40] (03CR) 10Smalyshev: "I would prefer having separate port for it, not talking directly to Blazegraph. Even internally, direct connection means access to SPARQL " [puppet] - 10https://gerrit.wikimedia.org/r/256039 (owner: 10Addshore) [19:08:22] yuvipanda: thoughts about https://phabricator.wikimedia.org/T119541 ? [19:08:57] (03PS1) 10Chad: hhvm_tc_space.py: use 'foo not in' instead of 'not foo in' for pep8 [puppet] - 10https://gerrit.wikimedia.org/r/256048 [19:09:20] godog: i wonder if it's a trusty vs jessie thing or somesuch [19:09:26] jynus: the argument above ^ was that if it works fine on other wikis on the same db cluster, it be OK to enable on all the wikis on that db cluster [19:09:31] godog: it clearly is failing on limn1 and on the maps test box [19:09:31] No idea if that's sound [19:09:37] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1839380 (10GWicke) > re: deleting data, when we need to do that how long would it take start to finish? e.g. if we run very tight (or out) of spac... [19:10:06] is there any usage of wikidata on meta? [19:10:16] yuvipanda: could be, I'll poke at it tomorrow a bit too [19:10:51] jynus: no idea... You mean, meta wiki pages that include data from wikidata I guess? [19:11:00] ori: I'll amend to have it suppress the warning instead. [19:11:02] such as... [19:11:12] ostriches: I think that is already the case, no? [19:11:16] we disable it across the repo [19:11:21] the import warning, I mean [19:11:21] I also have no idea why enabling some Lua thing would mean doing anything new with the meta wiki db [19:12:11] (03CR) 10Rush: [C: 031] "seems like this is the new norm :)" [puppet] - 10https://gerrit.wikimedia.org/r/255528 (owner: 10Filippo Giunchedi) [19:12:13] IIRC pep8 has no ignore-this-line pragma [19:12:17] flake8 has '# noqa' [19:12:27] ori: I don't see it in .pep8 files, mostly just E501 about line-too-long. [19:12:38] we should stop using pep8 [19:12:41] for puppet [19:12:53] why? [19:13:00] and use flake8 instead [19:13:05] since everything else uses flake8 [19:13:14] flake8 also respects tox.ini.. [19:13:23] R-E-S-P-E-C-T [19:14:24] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1839384 (10yuvipanda) 5Invalid>3Open Still happening: see limn1 for an example [19:14:38] (03Abandoned) 10Chad: tcpircbot: fix up pep8 warnings about imports [puppet] - 10https://gerrit.wikimedia.org/r/256047 (owner: 10Chad) [19:14:48] (03Abandoned) 10Chad: hhvm_tc_space.py: use 'foo not in' instead of 'not foo in' for pep8 [puppet] - 10https://gerrit.wikimedia.org/r/256048 (owner: 10Chad) [19:14:51] (03Abandoned) 10Chad: pep8: fix up some style warnings in mysql/skrillex.py misc/demux.py [puppet] - 10https://gerrit.wikimedia.org/r/256045 (owner: 10Chad) [19:14:53] (03Abandoned) 10Chad: pep8: fix up webperf python files [puppet] - 10https://gerrit.wikimedia.org/r/255295 (owner: 10Chad) [19:14:56] (03Abandoned) 10Chad: pep8 fixes for elasticsearch_monitoring.py [puppet] - 10https://gerrit.wikimedia.org/r/255288 (owner: 10Chad) [19:14:58] if you run "apt-get update" on any random labs instance, is it broken there too? [19:15:19] with a "invalid filename extension" message [19:16:42] jynus: really I don't know much about Wikidata integration... I remember it can grab data for Infoboxes, I guess for any arbitrary wiki element. Don't know how to check how or if it's used on Meta. Is that relevant? [19:19:34] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail [19:19:40] (03PS3) 10Ottomata: Rename timestamp to ts for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/252432 (https://phabricator.wikimedia.org/T117873) (owner: 10DCausse) [19:20:03] (03CR) 10Ottomata: [C: 032 V: 032] Rename timestamp to ts for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/252432 (https://phabricator.wikimedia.org/T117873) (owner: 10DCausse) [19:20:38] ostriches: some of those looked OK to me [19:20:53] yuvipanda doesn't like pep8 so I abandoned them :) [19:21:15] (03Restored) 10Ori.livneh: hhvm_tc_space.py: use 'foo not in' instead of 'not foo in' for pep8 [puppet] - 10https://gerrit.wikimedia.org/r/256048 (owner: 10Chad) [19:21:25] (03PS2) 10Ori.livneh: hhvm_tc_space.py: use 'foo not in' instead of 'not foo in' for pep8 [puppet] - 10https://gerrit.wikimedia.org/r/256048 (owner: 10Chad) [19:21:32] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm_tc_space.py: use 'foo not in' instead of 'not foo in' for pep8 [puppet] - 10https://gerrit.wikimedia.org/r/256048 (owner: 10Chad) [19:22:06] ostriches: I was talking about pep8 the tool rather than pep8 the standard :D [19:22:06] (03Restored) 10Ori.livneh: pep8: fix up some style warnings in mysql/skrillex.py misc/demux.py [puppet] - 10https://gerrit.wikimedia.org/r/256045 (owner: 10Chad) [19:22:12] (03PS2) 10Ori.livneh: pep8: fix up some style warnings in mysql/skrillex.py misc/demux.py [puppet] - 10https://gerrit.wikimedia.org/r/256045 (owner: 10Chad) [19:22:20] (03CR) 10Ori.livneh: [C: 032 V: 032] pep8: fix up some style warnings in mysql/skrillex.py misc/demux.py [puppet] - 10https://gerrit.wikimedia.org/r/256045 (owner: 10Chad) [19:22:59] (03Restored) 10Ori.livneh: pep8 fixes for elasticsearch_monitoring.py [puppet] - 10https://gerrit.wikimedia.org/r/255288 (owner: 10Chad) [19:23:00] yuvipanda: I'm going to continue to twist your words and say "yuvipanda likes ugly code that doesn't conform to any standard" [19:23:02] :P [19:23:06] (03PS2) 10Ori.livneh: pep8 fixes for elasticsearch_monitoring.py [puppet] - 10https://gerrit.wikimedia.org/r/255288 (owner: 10Chad) [19:23:12] (03CR) 10Ori.livneh: [C: 032 V: 032] pep8 fixes for elasticsearch_monitoring.py [puppet] - 10https://gerrit.wikimedia.org/r/255288 (owner: 10Chad) [19:23:20] (03PS2) 10Yuvipanda: Toollabs bastion: install GNU automake [puppet] - 10https://gerrit.wikimedia.org/r/255988 (https://phabricator.wikimedia.org/T119870) (owner: 10Zhuyifei1999) [19:23:27] (03CR) 10Yuvipanda: [C: 032 V: 032] Toollabs bastion: install GNU automake [puppet] - 10https://gerrit.wikimedia.org/r/255988 (https://phabricator.wikimedia.org/T119870) (owner: 10Zhuyifei1999) [19:24:49] ori: wattadowith https://gerrit.wikimedia.org/r/#/c/251800/ [19:25:18] (03PS3) 10Addshore: WDQS set 5min timeout for stat1002 access [puppet] - 10https://gerrit.wikimedia.org/r/256039 [19:25:53] yuvipanda: not sure, need to think about it. maybe only set those defaults if $password == false. it should be possible to promote a jobrunner slave to a master using CONFIG [19:26:13] also we should maybe complete the migration to redis::instance and do it there instead [19:26:19] yeah [19:26:23] totally [19:26:32] i haven't forgotten about it, i was going to do tin later [19:26:38] (03CR) 10Addshore: "So this simply sets a larger timeout in the header for stat1002 requests now." [puppet] - 10https://gerrit.wikimedia.org/r/256039 (owner: 10Addshore) [19:26:42] ok! [19:26:47] I should move tools redis [19:26:52] but it needs downtime announcement [19:26:53] (03PS1) 10Merlijn van Deen: toollabs: remove motd-tips [puppet] - 10https://gerrit.wikimedia.org/r/256053 (https://phabricator.wikimedia.org/T104327) [19:28:53] (03PS2) 10Yuvipanda: toollabs: remove motd-tips [puppet] - 10https://gerrit.wikimedia.org/r/256053 (https://phabricator.wikimedia.org/T104327) (owner: 10Merlijn van Deen) [19:29:01] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: remove motd-tips [puppet] - 10https://gerrit.wikimedia.org/r/256053 (https://phabricator.wikimedia.org/T104327) (owner: 10Merlijn van Deen) [19:29:04] (03CR) 10Addshore: "Currently trying to access one of the instances from stat1002 doesn't seem to work with the following: curl --verbose "http://wdqs1002.eqi" [puppet] - 10https://gerrit.wikimedia.org/r/256039 (owner: 10Addshore) [19:29:14] (03CR) 10Chad: [C: 032] Remove deprecated wgRateLimitLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253779 (owner: 10BryanDavis) [19:29:59] (03Merged) 10jenkins-bot: Remove deprecated wgRateLimitLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253779 (owner: 10BryanDavis) [19:31:39] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: rm deprecated/unused rate limit log config (duration: 00m 28s) [19:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:11] mutante: how do I make a specific check page only a certain contactgroup? [19:37:19] mutante: is that done anywhere else atm? [19:39:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [19:42:14] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1839471 (10chasemp) >>! In T119541#1839384, @yuvipanda wrote: > Still happening: see limn1 for an example any common thread for broken instances (since it doesn't seem to be unive... [19:42:50] yuvipanda: it's done with email but not with paging. it's possible but needs some new notification commands [19:42:53] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures [19:43:19] i can look at it if you want to assign something to me but need some time to stare at it again [19:43:58] yuvipanda: could you confirm if the apt issue is global to labs? [19:45:06] ah, need to afk for a moment, feel free to give me a ticket for that paging thing if you want [19:45:09] mutante: hmm, where else is it being done with email? [19:45:16] team-services [19:45:25] mutante: ok! I'll add you on the ticket and look at other stuff (re: apt) [19:45:39] (03PS2) 10Ori.livneh: Clean up l10nupdate settings (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/256025 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [19:45:46] (03CR) 10Ori.livneh: [C: 032 V: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/256025 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [19:45:56] yuvipanda: thanks, it might be that all package upgrades are broken on all instances [19:46:44] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:47:06] !log running `nodetool decommission` on restbase1009 in preparation for the conversion to the multi-instance setup, per https://phabricator.wikimedia.org/T95253# [19:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:34] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [19:59:37] ori: i'm not sure thei systemd template thing is better [19:59:46] (03PS2) 10BryanDavis: Clean up l10nupdate settings (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) [19:59:52] its kinda nicer, because then there will be fewer duplication of text on the server [20:00:01] but, it makes the puppetization less nice [20:00:12] and i could just use puppet templates to do the same thing [20:00:13] (03CR) 10jenkins-bot: [V: 04-1] Clean up l10nupdate settings (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [20:00:20] only difference would be not using @ to address the instance [20:00:37] unless there is some way that systemd groups the instances together so that they can all be shut down with a single command... [20:00:56] (03PS3) 10BryanDavis: Clean up l10nupdate settings (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) [20:01:03] doesn't look like it thouhg [20:01:05] (03CR) 10BryanDavis: Clean up l10nupdate settings (2/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [20:05:44] !log re-enabled puppet on neodymium, minion testing concluded for now [20:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:04] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:12:04] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [20:13:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [20:13:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [20:15:53] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [20:21:14] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [20:23:34] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [20:23:43] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:24:54] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [20:26:49] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1839536 (10Bawolff) For 388px size of https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time's_ma... [20:28:05] (03PS4) 10Addshore: Set WDQS 5min expiry for internal access Port:8888 [puppet] - 10https://gerrit.wikimedia.org/r/256039 [20:31:24] mutante: We've got some left over uid 997 stuff on tin from the l10nupdate renumbering -- https://phabricator.wikimedia.org/T119746#1839042 [20:33:30] bd808: Is there a a lot of files? [20:33:41] (03CR) 10Ottomata: "Cool, LGTM. Hold off for joal's go ahead." [puppet] - 10https://gerrit.wikimedia.org/r/256020 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [20:34:12] (03CR) 10Smalyshev: [C: 031] Set WDQS 5min expiry for internal access Port:8888 [puppet] - 10https://gerrit.wikimedia.org/r/256039 (owner: 10Addshore) [20:34:35] (03CR) 10Ottomata: [C: 031] "Oh, woops this is the statsv one. Go ahead!" [puppet] - 10https://gerrit.wikimedia.org/r/256020 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [20:38:59] (03PS3) 10BBlack: statsv: switch "ip" field to X-Client-IP like webrequest [puppet] - 10https://gerrit.wikimedia.org/r/256020 (https://phabricator.wikimedia.org/T118557) [20:39:19] (03PS1) 10Smalyshev: Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) [20:39:21] (03CR) 10BBlack: [C: 032 V: 032] statsv: switch "ip" field to X-Client-IP like webrequest [puppet] - 10https://gerrit.wikimedia.org/r/256020 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [20:39:53] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1839562 (10Bawolff) All this sounds like there is just exceptionally high packet loss on 239.128.0.113 relative to 239.128.0.1... [20:40:10] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1839563 (10GWicke) Per IRC conversation with @fgiunchedi and @eevans I kicked off the decommission on restbase1009. It is streaming to 1005 and 10... [20:40:15] (03PS2) 10Smalyshev: WDQS add X-Served-By header to response [puppet] - 10https://gerrit.wikimedia.org/r/256043 (https://phabricator.wikimedia.org/T119508) (owner: 10Addshore) [20:40:37] (03PS2) 10Smalyshev: Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) [20:41:50] (03CR) 10Addshore: [C: 031] Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) (owner: 10Smalyshev) [20:42:12] (03Abandoned) 10Addshore: WDQS add X-Served-By header to response [puppet] - 10https://gerrit.wikimedia.org/r/256043 (https://phabricator.wikimedia.org/T119508) (owner: 10Addshore) [20:43:01] (03PS3) 10Smalyshev: Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) [20:43:44] PROBLEM - git_daemon_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon [20:43:45] (03CR) 10Addshore: [C: 031] Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) (owner: 10Smalyshev) [20:45:43] RECOVERY - git_daemon_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [20:48:00] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1839575 (10BBlack) Well anything's possible, but AFAIK there's no special treatment of 239.128.0.113 vs .112 in terms of multi... [20:51:47] (03CR) 10BBlack: "Are we still holding this on esams link upgrades?" [dns] - 10https://gerrit.wikimedia.org/r/239072 (owner: 10Faidon Liambotis) [20:54:04] (03PS2) 10Isart: Adding diamond collector to send P_S metrics to graphite Fixing user/pass on template [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256007 [20:57:04] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:00:29] ^ that was me, fixed [21:01:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:19:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [21:31:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [21:32:58] bd808: i'm checking that right now, i'm surprised because i ran find with / [21:34:38] maybe that (system)user ran git clone and was still logged in, in the same session [21:35:29] (03PS1) 10Reedy: Remove pear php-mail related packages [puppet] - 10https://gerrit.wikimedia.org/r/256119 [21:35:41] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1839667 (10Jdrewniak) I set up my ssh config file according to https://wikitech.wikimedia.org/wiki/Analytics/... [21:36:07] yea, confirmed those files in /var/lib/10nupdate/mediawiki/ owned by the old UID, i'm positive though these were not there when i ran find multiple times before ... [21:36:39] mutante: yuvipanda fixed some objects for me over the weekend too [21:36:59] (03CR) 10Reedy: [C: 04-1] "Damn it" [puppet] - 10https://gerrit.wikimedia.org/r/256119 (owner: 10Reedy) [21:37:06] Reedy: then there must be something that re-breaks them [21:37:20] i ran find on the entire / and they were all gone ..hrmm [21:38:45] (03PS2) 10Reedy: Remove pear php-mail related packages [puppet] - 10https://gerrit.wikimedia.org/r/256119 [21:38:53] unless cron was running something as the wrong user... [21:39:01] could there be something related to the cron job that makes it hang on to the old id? [21:39:36] snap :) [21:39:45] yea, cron sounds like something ..:p [21:40:17] all the files here were in ./core/.git/objects/ [21:40:27] which I think is what Yuvi fixed for me [21:40:29] (03PS1) 10Catrope: Set $wgEchoSharedTrackingDb on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256120 (https://phabricator.wikimedia.org/T119522) [21:40:42] I guess we should let it run again and see if it makes a mess :) [21:40:48] (03PS1) 10ArielGlenn: give jgirault and jdrewniak bastion access [puppet] - 10https://gerrit.wikimedia.org/r/256121 (https://phabricator.wikimedia.org/T118998) [21:40:52] i fixed the file permissions [21:40:55] 03:01 YuviPanda: run chown -R l10nupdate: /var/lib/l10nupdate/mediawiki for Reedy on tin [21:40:55] yes, run it again [21:41:07] root@tin:/var/lib/l10nupdate/mediawiki# find . -uid 997 -exec chown l10nupdate {} \; [21:41:40] would forcing cron to reload or something help? [21:42:16] (03PS2) 10ArielGlenn: give jgirault and jdrewniak bastion access [puppet] - 10https://gerrit.wikimedia.org/r/256121 (https://phabricator.wikimedia.org/T118998) [21:42:24] so is a cron job running git pull? [21:42:34] looks [21:43:13] yeah, eventually [21:43:17] (03CR) 10ArielGlenn: [C: 032] give jgirault and jdrewniak bastion access [puppet] - 10https://gerrit.wikimedia.org/r/256121 (https://phabricator.wikimedia.org/T118998) (owner: 10ArielGlenn) [21:43:21] l10nupdate-1 [21:45:52] It should be running https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate-1 [21:46:06] https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate is for other people to run it [21:46:29] l10nupdates cron... [21:46:29] 0 2 * * * /usr/local/bin/l10nupdate-1 --verbose >> /var/log/l10nupdatelog/l10nupdate.log 2>&1 [21:47:10] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1839703 (10ArielGlenn) Good catch. Your accounts are now also both live on bast1001.wi... [21:50:20] (03CR) 10Reedy: Remove pear php-mail related packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256119 (owner: 10Reedy) [21:50:49] Reedy: bd808: [21:50:56] root@tin:/# find / -uid 997 [21:50:56] find: `/proc/3500/task/3500/fd/5': No such file or directory [21:50:56] find: `/proc/3500/task/3500/fdinfo/5': No such file or directory [21:51:03] I suspect the crontab is still running it as the old user [21:51:08] those processes above... [21:51:12] must be the cron [21:51:54] rewrites crontab for that user [21:52:37] restarts cron service [21:52:44] :) [21:54:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [21:54:43] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [21:55:52] (03CR) 10Mattflaschen: [C: 032] Set $wgEchoSharedTrackingDb on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256120 (https://phabricator.wikimedia.org/T119522) (owner: 10Catrope) [21:55:59] !log re-wrote l10nupdate cron; restarted cron service on tin [21:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:10] hrmm.. does that really do it... [21:56:14] (03Merged) 10jenkins-bot: Set $wgEchoSharedTrackingDb on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256120 (https://phabricator.wikimedia.org/T119522) (owner: 10Catrope) [21:56:30] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1839753 (10Jdrewniak) thanks! I 'm able to login to stat1002 now :) [21:56:34] has it killed the tasks? [21:56:59] I presume you can't edit the crontab for hte uid? [21:57:02] every time i search there's a new PID [21:57:44] i opened it with crontab -e for the user name [21:57:58] and saved it again [21:58:16] and it said specifically how it was writing it, not "no changes" [21:58:27] make a comment edit or something? [21:58:46] yes, it's also puppetized [21:59:03] commenting it, running puppet [22:00:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [22:00:34] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [22:04:55] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1839764 (10Bawolff) > > The other general issue we face (both in historically and now) is the cache layering race: these ima... [22:05:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1839765 (10ArielGlenn) Can you nudge jgirault to check too (I dunno if you are in the s... [22:05:53] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: puppet fail [22:14:44] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:14:44] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:16:19] Reedy: i'm at the point where ##linux finds the issue "interesting" :p [22:16:42] you mean, change uid, but they seem to still have processes/possible contrab? [22:18:37] mutante: just reboot tin [22:18:37] :D [22:24:20] Reedy: i actually did that with mira :p [22:24:24] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1839829 (10Bawolff) I received a user complaint about https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/PL_Karol_May_-... [22:38:23] (03PS1) 10Merlijn van Deen: package_builder: clarify how to download a package [puppet] - 10https://gerrit.wikimedia.org/r/256125 [22:40:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858861 MB (50% inode=99%) [22:45:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858861 MB (50% inode=99%) [22:45:14] PROBLEM - puppet last run on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:38] Reedy: bd808: soo.. yea.. to confirm the cron service restarted fixed it, i temp. stopped puppet, added a new cronjob in the same crontab [22:47:49] yay [22:47:51] and let that write to a file in /tmp and it's owned by the new UID [22:47:55] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1839876 (10Tgr) Sorry for being unresponsive. How do I access these packages? [22:48:01] nice [22:49:44] PROBLEM - RAID on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:49:51] those lines i pasted from find, that was a red herring [22:50:11] PROBLEM - MariaDB Slave Lag: s7 on db1034 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 327 [22:50:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858861 MB (50% inode=99%) [22:50:37] the one for db1034 paged [22:51:01] <_joe_> whoa [22:51:12] <_joe_> what's up with that db? [22:51:43] RECOVERY - RAID on db1034 is OK: OK: optimal, 1 logical, 2 physical [22:52:42] s7 watchlist etc [22:53:01] <_joe_> Reedy: care to explain? [22:53:15] I was just saying which db slave it was [22:54:01] RECOVERY - MariaDB Slave Lag: s7 on db1034 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [22:54:38] <_joe_> did we just perform some schema change there? [22:54:56] It's not impossible, but I don't know if any are ongoing [22:55:12] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1839882 (10Dzahn) >>! In T119746#1839042, @bd808 wrote: > There are still some files u... [22:55:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858857 MB (50% inode=99%) [22:58:09] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1839888 (10Ottomata) @gwicke and I discussed the schema/revision in meta issue in IRC today. He had an idea that I quite like! @gwicke suggested t... [23:00:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858793 MB (50% inode=99%) [23:05:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858793 MB (50% inode=99%) [23:05:44] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures [23:10:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858793 MB (50% inode=99%) [23:13:44] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 7Monitoring: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#1839912 (10Smalyshev) 3NEW [23:15:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858793 MB (50% inode=99%) [23:20:05] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858792 MB (50% inode=99%) [23:24:38] mutante: wait when did you get paged? today? [23:25:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858792 MB (50% inode=99%) [23:25:26] apergos: yes, about once per day [23:25:36] try this [23:25:42] I still got no pages [23:25:44] /last db1034 (just one was paging) [23:26:16] <_joe_> apergos: you're out of your paging hours, I guess [23:26:18] <_joe_> I am now [23:26:24] no, I'm in 24hr [23:26:25] yea, probably a timezone thing? [23:26:28] unless rob changed it [23:26:28] oh [23:26:33] PROBLEM - puppet last run on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:39] :p ... [23:26:58] <_joe_> again? [23:27:03] <_joe_> shit [23:27:05] same host [23:27:09] but different check [23:27:14] yea [23:27:22] robh: paging issue remains for me it seems [23:28:00] <_joe_> it's happening again [23:28:12] <_joe_> I can't even ssh into db1034 ofc [23:30:05] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858792 MB (50% inode=99%) [23:30:35] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Set up backend per-IP limits on varnish for WDQS - https://phabricator.wikimedia.org/T119917#1839938 (10Smalyshev) 3NEW [23:31:00] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Set up backend per-IP limits on varnish for WDQS - https://phabricator.wikimedia.org/T119917#1839945 (10Smalyshev) [23:32:33] apergos: not getting pages? [23:32:40] no [23:33:05] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:33:09] well im just logging into the portal to see if you got the page in there [23:33:35] ok [23:34:25] interesting [23:34:29] so everyone else shows delivered [23:34:31] but yours show sent [23:34:36] grrrr [23:35:03] i'll put in for a new support ticket to ask whats up [23:35:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858792 MB (50% inode=99%) [23:36:06] apergos: do you konw when you last got a page? [23:36:31] cuz at first i see sent, then when i hit 2015-11-27 i show rejected [23:36:39] * robh is working backwards in the report [23:36:51] yeah I know I didn't get any during the outage last week [23:37:12] but I don't keep them forever so I don't know when I last got one [23:37:34] I mean usually I delete several of them and then wait for them to queue up again [23:37:44] but not always all of them... so... [23:38:05] well i put in for the report to generate and email me for all messages sent to you [23:38:10] so we shall see [23:38:19] I'm goign to open a task in the private vendor space as well since it'll include your cell [23:38:54] meh... or nda, i dunno yet [23:40:03] I'm going to do something simple and restart my phone [23:40:11] just in case [23:40:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858792 MB (50% inode=99%) [23:40:28] apergos: lemme know when its restarted [23:40:32] cuz we can fire off a test from portal [23:41:36] ok [23:41:39] do it [23:42:10] robh: [23:42:18] darn non functioning tab complete with your name [23:42:28] received wtf [23:42:31] sent [23:42:45] sh*t I'm getting a pile of them now [23:42:50] what on earth [23:42:52] ha, it was your phone ;] [23:43:01] wel it would have been nice to know [23:43:07] mine gets shitty too, things i never had happen in ios [23:43:07] no indicator that anything was wrong at all [23:43:10] that happen in droid. [23:43:19] i have to reboot my moto x about once every 3 weeks or so [23:43:20] 11 so far... [23:43:28] oh you are going to get a shit ton [23:43:30] maybe I just just reboot every coupl eweeks [23:43:32] there were dozens [23:43:48] i imagine all the rejected were old enough to reject out (they are much older, 27th older) [23:43:56] and then the 'sent' will start coming back as delivered in a bit [23:44:11] so far just 11 [23:44:17] the last one is a test (you I guess) [23:44:20] yep now shows delivered on the one i just sent [23:44:27] rather than just sent, they are all triggering to delivered now [23:44:33] problem solved \o/ [23:44:34] well better plug the phone back in :-D [23:44:52] solved windoze style [23:44:57] i reboot mine when i get paid, sad but true [23:45:00] "did you try restarting it?" [23:45:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858792 MB (50% inode=99%) [23:45:55] thanks for checking and for the test [23:47:21] welcome =] [23:50:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858792 MB (50% inode=99%) [23:51:18] (03Abandoned) 10Dzahn: mediawiki: specify uid 10002 for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [23:55:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858788 MB (50% inode=99%) [23:58:00] (03CR) 10Bartosz DziewoƄski: [C: 031] add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn)