[00:08:32] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 10.65.0.1 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [00:09:33] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 10.65.0.1, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 [00:13:02] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 10.65.0.1 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [00:14:02] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 10.65.0.1, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 [00:51:42] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 311 seconds [00:52:01] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 328 seconds [00:54:01] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:12] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [01:57:59] (03CR) 10Tim Landscheidt: typo in dbstore2001/2002 entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/182363 (owner: 10RobH) [03:29:07] !log clone and deploy es2002 es2003 es2004 [03:29:12] Logged the message, Master [03:31:15] (03PS1) 10Springle: deploy es2002 es2003 es2004 [puppet] - 10https://gerrit.wikimedia.org/r/182429 [03:33:03] (03CR) 10Springle: [C: 032] deploy es2002 es2003 es2004 [puppet] - 10https://gerrit.wikimedia.org/r/182429 (owner: 10Springle) [03:43:31] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:58:39] (03PS1) 10Springle: repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182430 [04:52:38] (03PS5) 10KartikMistry: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 [04:53:17] springle: I think this ^^ is OK now. Added core devs as reviewers as you suggested :) [04:58:38] :) [04:58:47] (03CR) 10Springle: [C: 032] repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182430 (owner: 10Springle) [04:59:49] !log springle Synchronized wmf-config/db-eqiad.php: repool db1061, warm up (duration: 00m 06s) [04:59:54] Logged the message, Master [06:20:46] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:05] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:15] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:46] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:56] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:05] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:56] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:46] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:45:49] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:30] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:09] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [07:23:59] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [07:30:28] 3operations: Dear ops-requests@rt.wikimedia.org, No Publication Fee for AASCIT Members - https://phabricator.wikimedia.org/T85687#952281 (10emailbot) [08:24:14] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [09:34:05] hmm, cmjohnson1 is on duty? I cannot see them [09:34:21] for whomever is on duty, not sure whether LABS warnings are relevant ... [09:34:21] shinken-wm> PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [09:58:27] (03CR) 10Steinsplitter: "@Hoo man/Reedy: If this patch is OK, can you please merge it? Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180560 (owner: 10Dereckson) [10:04:48] 3operations: mail spam from diamond "cannot resolve host" - https://phabricator.wikimedia.org/T85691#952332 (10fgiunchedi) 3NEW a:3fgiunchedi [10:13:21] 3operations: mail spam from diamond "cannot resolve host" - https://phabricator.wikimedia.org/T85691#952343 (10yuvipanda) One of the many side effects of T72076 [10:13:35] godog: ^ I get about 100 puppet failure emails because of that [10:14:53] YuviPanda: of T72076 ? [10:15:13] godog: yeah. [10:17:27] 3operations: mail spam from diamond "cannot resolve host" - https://phabricator.wikimedia.org/T85691#952347 (10fgiunchedi) [10:17:36] YuviPanda: good catch, I've merged mine into that [10:18:46] :) [10:21:46] * YuviPanda curses every. single. telecom. operator. in. the. entire. world. [11:48:00] !log reboot es2004, debugging gmond stuck on start/stop [11:48:03] Logged the message, Master [12:37:22] (03PS1) 10Faidon Liambotis: ganglia_new: set gmond's daemonize to "yes" [puppet] - 10https://gerrit.wikimedia.org/r/182451 [12:42:41] (03PS15) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:42:48] valhallasw`cloud: heh, this patchset now looks almost like what you were suggesting yesterday :P [12:43:00] :> [12:43:16] yes, looks much better :-) [12:43:47] YuviPanda: ask hashar to set up flake8 while you're at it :-p [12:43:50] valhallasw`cloud: :D Yeah, once I realized the decorator is going to be uselessish. [12:44:31] valhallasw`cloud: oh, that. [12:45:49] (03PS16) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:45:54] valhallasw`cloud: is functionally complete now, I think. want to CR / +1? [12:46:00] YuviPanda: mmm [12:46:07] YuviPanda: first other stuff to do, sorry [12:46:14] valhallasw`cloud: :) ’tis ok! [12:51:54] (03PS3) 10Yuvipanda: Fix issues with whitelisted & greylisted yaml files [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182102 [12:51:56] (03PS17) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:51:58] (03PS1) 10Yuvipanda: Add .pyc to gitignore [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182452 [12:52:21] (03CR) 10Yuvipanda: [C: 032 V: 032] Add .pyc to gitignore [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182452 (owner: 10Yuvipanda) [13:21:35] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3978.399902 [13:27:45] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:55:32] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [14:05:30] (03CR) 10Filippo Giunchedi: [C: 031] ganglia_new: set gmond's daemonize to "yes" [puppet] - 10https://gerrit.wikimedia.org/r/182451 (owner: 10Faidon Liambotis) [14:07:38] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:17:32] (03CR) 10Faidon Liambotis: [C: 032] ganglia_new: set gmond's daemonize to "yes" [puppet] - 10https://gerrit.wikimedia.org/r/182451 (owner: 10Faidon Liambotis) [14:18:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:38:35] (03CR) 10Ottomata: [C: 032 V: 032] Use generic snappy version and fix CLASSPATH [debs/kafka] - 10https://gerrit.wikimedia.org/r/178361 (owner: 10Plucas) [14:48:32] ACKNOWLEDGEMENT - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi pending disk swap [14:54:11] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [14:54:51] PROBLEM - puppet last run on heze is CRITICAL: CRITICAL: puppet fail [14:56:01] PROBLEM - puppet last run on capella is CRITICAL: CRITICAL: puppet fail [15:00:52] RECOVERY - RAID on ms-be2003 is OK: OK: optimal, 13 logical, 13 physical [15:02:22] PROBLEM - puppet last run on haedus is CRITICAL: CRITICAL: puppet fail [15:30:28] (03CR) 10Merlijn van Deen: [C: 04-1] "Looks good overall, I just have a gazillion nitpicky things ;-)" (0324 comments) [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 (owner: 10Yuvipanda) [15:32:36] YuviPanda: ^ :-D [15:32:55] valhallasw`cloud: :D yeah, responding / fixing. re: License, there’s a LICENSE file in rep [15:32:56] o [15:33:11] YuviPanda: legal sez it should be in every file [15:33:45] YuviPanda: author, project and short license info [15:34:01] lemmefindit [15:34:11] https://wikimania2014.wikimedia.org/wiki/Submissions/Open_Source_Hygiene:_Getting_the_Details_Right [15:34:36] http://lu.is/blog/wp-content/uploads/2014/09/Open-Source-Hygiene-1.pdf#viewer.action=download [15:35:17] slide 27 is the suggested method [15:35:31] valhallasw`cloud: awww, damn :D [15:35:31] ok [15:35:41] YuviPanda: it's for the good of the commons! [15:35:53] fine, fine :) [15:36:47] and you'll get a virtual hug from me if you do it! ;-D [15:37:43] valhallasw`cloud: I shall :) going through things now. [15:37:56] valhallasw`cloud: also, reason it is tablename: list [15:38:01] is because we primarily care about tables. [15:38:07] and usually, if you look at the report [15:38:11] some tables are in like a few thousand dbs [15:38:15] and they are the same table [15:38:18] and we want to either expose them [15:38:20] or kill them [15:38:24] ahh yeah [15:38:25] makes sense [15:38:27] per-db we don’t raelly care [15:38:33] everything is at the table level [15:38:44] policy is on the table level, yeah [15:38:46] makes sense [15:41:53] (03CR) 10Yuvipanda: Refactor to not be a big ball of mud (037 comments) [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 (owner: 10Yuvipanda) [15:42:02] valhallasw`cloud: :) I’m fixing the other things [15:42:42] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:43:32] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:44:52] RECOVERY - puppet last run on capella is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:48:47] YuviPanda: whitelisted=True: plz document that behavior [15:49:03] same for the condition= being SQL format [15:49:07] valhallasw`cloud: yeah, I shall. am wondering, if I should use asserts. [15:49:10] (maybe add an example) [15:49:23] consistency checks are good [15:49:27] oh yeah, documentation isn’t the best [15:50:14] YuviPanda: at least put the my.cnf location in config, then [15:50:22] .my.cnf can also specify host, btw [15:50:29] true [15:50:33] oh but there are multiple hosts [15:50:34] nm [15:50:51] valhallasw`cloud: I should write what exactly is a LabsDB [15:50:57] valhallasw`cloud: also how exactly our replication architecture workos [15:50:58] yesplz <3 [15:50:58] *works [15:51:02] RECOVERY - puppet last run on haedus is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:51:04] there’s no documentation about it at all [15:51:05] anywhere [15:51:10] other than in 3-4 people’s heads [16:44:16] hello all :) I accidently removed the Google authenticator account for the 2FA of phabricator. Is here someone who can help recover my account (removing the 2FA)? [16:46:11] (see https://wikitech.wikimedia.org/wiki/Phabricator ) [17:17:56] (03PS18) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [17:18:02] valhallasw`cloud: added per-file headers :) [17:18:11] * valhallasw`cloud hugs YuviPanda [17:22:46] (03PS19) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [17:23:50] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:24:39] YuviPanda: hi :) Short question: have you shell access for phabricator host? :) [17:26:33] FlorianSW: heya! I would, yeah. why? [17:26:53] i have a little problem, see my message above: [17:26:55] hello all :) I accidently removed the Google authenticator account for the 2FA of phabricator. Is here someone who can help recover my account (removing the 2FA)? [17:27:02] [17:46] (see https://wikitech.wikimedia.org/wiki/Phabricator ) [17:27:11] and maybe you can help me :) [17:27:17] ah [17:27:21] > Please note that removal of 2FA is a serious request, and all too easily socially engineered. All requests of this nature should be treated with the same degree of security and confirmation as ssh key changes. [17:27:42] FlorianSW: I’m not sure how I can verify that you *are* the person who is FlorianSW... [17:27:49] and I’m also not sure what the policy for these things is :| [17:27:55] andre__: ^ [17:28:29] hmm, yeah, that's correct [17:29:32] YuviPanda, Yeah. I normally ask people to make a certain change on their user page on mw.org from their account, with a certain summary. [17:29:41] Haven't come up with a better / crazier workaround yet [17:30:09] FlorianSW: where are you from? [17:30:28] T13|mobile: Germany [17:30:37] andre__: ah, hmm. so I would have to connect FlorianSW’s phab username to their mw.org user account and verify an edit... [17:30:49] andre__, YuviPanda: that would be no problem [17:31:00] YuviPanda: it's already connected [17:31:01] https://phabricator.wikimedia.org/p/Florian/ is already connected [17:31:05] aha! [17:31:06] cool [17:32:27] FlorianSW: see pm [17:32:36] YuviPanda: see pm :P [17:36:11] FlorianSW: yw :) [17:36:18] :D [17:36:20] andre__: mind if I add this to the phabricator page on wikitech? [17:36:34] andre__: We maybe should document a proper way for recovering these things :/ [17:36:43] damn, YuviPanda is fast, really fast :D [17:36:53] FlorianSW: we can add what we just did to the wikitech page [17:37:03] ah, ok :) [17:37:07] let me do that now [17:37:28] but maybe there should be a link on the help page, too. Some of the users maybe doesn't look at wikitech [17:37:34] YuviPanda, go ahead please [17:37:34] that is definitely [17:37:34] true [17:37:39] (doc'ing it) [17:37:48] andre__: yeah, I am adding it on the wikitech page. [17:38:40] thanks YuviPanda and andre__ [17:38:46] I didn't do much :) [17:39:01] Krenair: andre__ hmm, even if they didn’t have their MW account connected, they *could* make an edit on wikitech [17:39:08] since that’s the same LDAP account as phab [17:39:31] which pretty much defeats the point of mfa :p [17:39:33] but yes [17:39:37] :P [17:39:38] yes [17:39:47] hmm [17:39:57] I’m now feeling uncomfortable about this, ish. [17:40:07] other idea: using google hangouts to check IDs, etc. [17:40:28] true, but then who is going to do the check? plus outing, etc [17:40:45] ....or using snailmail to send a scan of your passport! [17:40:56] LCA does this kind of thing all the time don't they? [17:42:58] yeah, but I dunno. FlorianSW is probably going to be pissed if we ask for a passport copy :) [17:44:26] YuviPanda: no comment :D [17:44:34] :P [17:44:59] no, it's really difficult. I don't know, but: how Google do it? [17:45:12] you can send a code to your smartphone via sms [17:45:21] but what, if you don't added your numberß [17:45:26] i never did that :P [17:45:59] but i think that a passport copy (if deleted after the recovering) is a good way, but a question: who says, that my real name is Florian ;) [17:47:02] "degree of security and confirmation as ssh key changes" how this is handeld? [17:47:12] YuviPanda, Krenair, andre__: ^ [17:47:32] FlorianSW: ah, I don’t know :D [17:47:36] great :D [17:48:17] i haven't found a page on wikitech about it [17:48:28] FlorianSW: usually, you are asked to either upload an ssh key via gerrit (proving you have LDAP creds) [17:48:37] FlorianSW: or put it on a wikipage in officewiki (staff) or wikitech [17:48:52] FlorianSW: https://phabricator.wikimedia.org/T85706?workflow=create [17:49:18] https://wikitech.wikimedia.org/wiki/Password_reset/Confirming_identities ? [17:49:45] i think the way over wikitech (instead of mwwiki) is a good way for now (which still moves the whole 2FA ad absurdum, but ok) [17:50:05] FlorianSW: wikitech also has 2FA :) [17:50:42] YuviPanda: yes, but i still can enable it on phab but leave disabled it on wikitech :D [17:50:50] :D [17:50:58] I’ve it enabled on wikitech (because cloudadmin) but not on phab [17:51:14] where you have WMF-NDA? [17:51:15] but I can verify myself and reset my 2FA token myself if I want, provided I’m in posession of my prod SSH key... [17:51:25] Krenair: oh, hmm. good point. [17:51:35] hmmm :) [17:51:39] In case someone feels a physical need to keep discussing this topic ad nauseam, there's also https://meta.wikimedia.org/wiki/Wikimedia_Forum#Wikimedia_should_have_a_contingency_plan_for_a_total_sitewide_password_.22reboot.22 [17:57:04] maybe sometimes T13|mobile can say us, if he had a good idea when asking me, where i'm from... if he comes back online... sometimes [17:59:18] This is silly. [17:59:29] * marktraceur is reading the passwords thing [17:59:48] Why would there ever be plaintext passwords? We aren't idiots. [18:00:56] marktraceur: loginviahttp! [18:01:09] I don't want to know. [18:02:29] Obviously the answer is to use PGP for logins everywhere. [18:03:35] marktraceur: there are plaintext passwords, stored in PrivateSettings.php ;) [18:03:53] True! But not for user accounts. [18:13:19] PROBLEM - Varnishkafka Delivery Errors on cp3017 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 394.108337 [18:14:00] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 607.68335 [18:16:29] RECOVERY - Varnishkafka Delivery Errors on cp3017 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:16:39] PROBLEM - Varnishkafka Delivery Errors on cp3007 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 844.450012 [18:17:09] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:17:39] PROBLEM - Varnishkafka Delivery Errors on cp3016 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 738.81665 [18:19:35] hmmm [18:19:39] qchris: ^ [18:20:24] Mhmm. [18:20:39] PROBLEM - Varnishkafka Delivery Errors on cp3004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 404.991669 [18:20:47] no 3022 [18:20:59] PROBLEM - Varnishkafka Delivery Errors on cp3015 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 461.891663 [18:21:00] Esams. upload+bits [18:21:06] yeah [18:21:12] That's a usual pattern. [18:21:59] That's why we kept analytics1021 out of the leaders. [18:22:27] Can you look on one of the machines to see if they can connect to the brokers? [18:22:59] RECOVERY - Varnishkafka Delivery Errors on cp3007 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:23:02] Also, you set up some connection monitoring between some esams caches and the brokers. [18:23:09] Do they show connection issues? [18:23:18] Or lost packets between the hosts? [18:23:19] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 515.18335 [18:23:50] RECOVERY - Varnishkafka Delivery Errors on cp3004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:23:50] PROBLEM - Varnishkafka Delivery Errors on cp3016 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 172.233337 [18:23:52] hm, some connection monitoring? [18:24:01] RECOVERY - Varnishkafka Delivery Errors on cp3015 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:24:32] During the analytics-ops sync-up meeting, where we changed the timeout on cp3022, you said that you also set up some connection tests. [18:24:40] Are they no longer running? [18:25:29] PROBLEM - Varnishkafka Delivery Errors on cp3010 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 377.299988 [18:26:19] um, hm, i do not remember saying such a thing. did I describe them? [18:26:30] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:26:59] RECOVERY - Varnishkafka Delivery Errors on cp3016 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:27:03] We talked about that the firewall is not allowing me to set it up, as it swallows pings. [18:27:14] And you said that you can do them from the caches to the brokers. [18:28:14] Meh. Regardless. [18:28:23] hm, i kinda remember that, but i don't thikn I ever set it up [18:28:24] Do the logs say anything meaningfull on the caches? [18:28:32] RECOVERY - Varnishkafka Delivery Errors on cp3010 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:28:43] just the usual, [18:28:44] Jan 2 18:25:56 cp3020 varnishkafka[11043]: PRODUCE: Failed to produce Kafka message (seq 5239829697): No buffer space available (500000 messages in outq) [18:28:44] Jan 2 18:26:24 cp3020 varnishkafka[11043]: PRODUCE: Suppressed 81783 (out of 81883) Kafka produce errors [18:28:44] Jan 2 18:26:33 cp3020 varnishkafka[11043]: KAFKADR: Kafka message delivery error: Local: Message timed out [18:29:40] k [18:31:56] Looks like things recovered. [18:33:04] ottomata: When looking at [18:33:04] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&hreg[]=%28amssq|cp%29.%2B&mreg[]=kafka.varnishkafka\.txerr.per_second>ype=line&title=kafka.varnishkafka\.txerr.per_second&aggregate=1 [18:33:30] (especially the Max reading) [18:33:33] The ones with high Max match the alerts. [18:33:39] Except for cp3022. [18:33:57] There wasn't an alert for cp3022 although it saw errors. [18:34:25] I am curious about the sequence stats for that hour. [18:35:01] If cp3022 also did better than the other hosts during that hour then I think the lowered timeout helped. [18:35:10] And we probably should lower the timeout in general [18:36:42] whoa check this out: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&hreg[]=%28amssq%7Ccp%29.%2B&mreg[]=kafka.rdkafka.brokers..%2B%5C.rtt%5C.avg>ype=line&title=kafka.rdkafka.brokers..%2B%5C.rtt%5C.avg&aggregate=1 [18:36:58] i guess that is since an21 being offline? no. that was only a few days ago [18:37:02] 5 or 6 days ago, right? [18:40:13] ottomata: I think that increase about the time people started to really use the cluster :-) [18:42:51] ? nawww [18:42:53] hah [18:43:03] maybe. but that is varnishkafka rtt! not hadoop stuff [18:43:50] Sure. [18:44:00] People started hammering the cluster. [18:44:20] That made camus take longer and some such. [18:44:33] Cause more/different load on kafka [18:46:15] h [18:46:16] hm [18:46:21] Meh. I guess you're right. [18:46:31] It's too far fetched at this point. [18:49:09] qchris, yeah, i'm looking disk activity on brokers, and I don't see any signifigant change [18:49:51] HM [18:50:02] but I do right now. [18:50:15] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=analytics1012.eqiad.wmnet%7Canalytics1018.eqiad.wmnet%7Canalytics1021.eqiad.wmnet%7Canalytics1022.eqiad.wmnet&mreg[]=diskstat_%28sda%7Csdb%7Csdc%7Csdd%7Csde%7Csdf%7Csdg%7Csdh%7Csdi%7Csdj%7Csdk%7Csdl%29_read_bytes_per_sec>ype=stack&title=diskstat_%28sda%7Csdb%7Csdc%7Csdd%7Csde%7Csdf%7Csdg%7Csdh%7Csdi%7Csdj%7Csdk%7Csdl%29_read_bytes_per_sec&aggregate=1 [18:50:25] i was running a crazy job [18:50:33] to convert a days worth of text data into parquet [18:50:38] and i was doing some other experiments on the sidd [18:50:48] with an hour of mobile data [18:51:25] And the spikes earlier today is me computing sequence stats. [18:51:43] both just died with a Error: Java heap space [18:52:01] :-) [18:52:04] hm, so, it sounds like hdfs activity is probably limiting camus which is causing kafka to slow down. [18:52:05] hmMMM [18:52:16] hadoop* [18:52:18] activity* [18:52:19] hm [18:52:24] ok, welp, we should probably: [18:52:28] 1. lower the vk timeout [18:52:38] 2. see if we can get camus to run really high priority [18:53:21] Full ACK. [18:56:23] should we do them at the same time? [18:56:32] maybe we should do just the timeout first and see if it happens again? [18:59:22] (03PS1) 10Ottomata: Lower request topic_request_timeout_ms on all varnishkafkas to 2000 (2 seconds) [puppet] - 10https://gerrit.wikimedia.org/r/182469 [18:59:28] qchris: ^ [18:59:50] * qchris looks [19:00:28] ottomata: It's Friday evening. Will the deployment gods like such a change? [19:00:29] (03PS2) 10Ottomata: Lower request topic_request_timeout_ms on all varnishkafkas to 2000 (2 seconds) [puppet] - 10https://gerrit.wikimedia.org/r/182469 [19:00:49] naw, probably not. [19:00:49] :) [19:01:19] (03CR) 10QChris: [C: 031] Lower request topic_request_timeout_ms on all varnishkafkas to 2000 (2 seconds) [puppet] - 10https://gerrit.wikimedia.org/r/182469 (owner: 10Ottomata) [19:02:14] I'd be around tomorrow, but I cannot merge stuff in operations/puppet. [19:02:28] I think it's fine to postpone until Monday. [19:02:32] +1 [19:03:06] Now that we saw the effect, I'll bring analytics1021 back in after grabbing something to eat. [19:05:21] ja let's wait til monday [19:05:27] ok [19:07:41] (03PS20) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [19:09:09] (03PS21) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [19:09:13] valhallasw`cloud: I think I’ve addressed everything you’ve pointed out. [19:09:29] man, it’s been a while since I’ve written a large enough python program :| [19:12:03] let me add properish logging [19:38:09] (03PS22) 10Yuvipanda: Rewrite to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [19:38:19] valhallasw`cloud: ^ :D Also, would particularly like review on the commit message itself. [19:39:35] UnsolicitedPanda: does not describe ;p [19:39:46] hmm? [19:41:02] I wrote a fair bit on why [19:42:30] updated [19:42:31] (03PS23) 10Yuvipanda: Rewrite to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [19:44:24] "Rewrite to not be a big ball of mud" ;p [19:44:53] I'll go through it one more time [19:45:09] valhallasw`cloud: that’s an accurate summary line, I think :D [19:45:15] it’s a very unatomic rewrite. [19:45:20] very little code survived [19:45:58] or atomic rewrite ;D [19:55:03] (03CR) 10Merlijn van Deen: [C: 04-1] "The hard-coded 1049 is the only reason to -1, otherwise pretty much +1 with things-that-could-be-different-but-are-definitely-OK-already" (039 comments) [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 (owner: 10Yuvipanda) [20:06:18] (03CR) 10Yuvipanda: Rewrite to not be a big ball of mud (032 comments) [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 (owner: 10Yuvipanda) [20:06:35] (03PS24) 10Yuvipanda: Rewrite to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [20:06:41] valhallasw`cloud: tada [20:07:52] yesmuchbetter [20:08:11] (03CR) 10Merlijn van Deen: [C: 031] "Suddenly, a +1 hugs you!" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 (owner: 10Yuvipanda) [20:10:14] * UnsolicitedPanda gives hugs to valhallasw`cloud too [20:11:27] valhallasw`cloud: grr [20:11:30] valhallasw`cloud: that doesn’t actually work [20:11:37] AttributeError: 'module' object has no attribute 'ER' [20:11:43] it uses stupid import magic [20:11:46] so I need to import it [20:11:48] to be able to use it [20:12:08] already it’s stupidly named [20:12:08] ER [20:12:11] what the fuck is ER? [20:12:14] Emergency Room? [20:13:00] valhallasw`cloud: idunno, I like the plain number + doc better than this. [20:15:54] (03PS25) 10Yuvipanda: Rewrite to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [20:25:06] (03PS4) 10Yuvipanda: Fix issues with whitelisted & greylisted yaml files [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182102 [20:25:08] (03PS26) 10Yuvipanda: Rewrite to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [20:25:10] (03PS1) 10Yuvipanda: Add requirements.txt [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182474 [20:25:24] (03CR) 10Yuvipanda: [C: 032] Add requirements.txt [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182474 (owner: 10Yuvipanda) [20:25:34] (03CR) 10Yuvipanda: [V: 032] Add requirements.txt [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182474 (owner: 10Yuvipanda) [20:25:47] (03CR) 10Yuvipanda: [C: 032 V: 032] Fix issues with whitelisted & greylisted yaml files [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182102 (owner: 10Yuvipanda) [20:26:04] (03CR) 10Yuvipanda: [C: 032 V: 032] Rewrite to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 (owner: 10Yuvipanda) [20:26:07] yay :) [20:31:09] UnsolicitedPanda: ERrors [20:31:20] why not ERR [20:31:22] like, everything [20:31:24] or [20:31:26] ERRORS [20:31:28] or ERROR [20:31:34] anything other than ER [20:31:50] Mysql probably does it that way :-P [20:32:10] Anyway, from mysqldb.constants.er import error [20:40:00] PROBLEM - Varnishkafka Delivery Errors on cp3017 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 533.533325 [20:46:10] RECOVERY - Varnishkafka Delivery Errors on cp3017 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.091667 [21:04:49] PROBLEM - puppet last run on ms1004 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [21:07:50] PROBLEM - Varnishkafka Delivery Errors on cp3017 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 85.708336 [21:08:59] PROBLEM - Varnishkafka Delivery Errors on cp3016 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 676.133362 [21:12:01] RECOVERY - Varnishkafka Delivery Errors on cp3016 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:12:39] PROBLEM - Varnishkafka Delivery Errors on cp3006 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 267.375 [21:13:09] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1250.616699 [21:13:23] ack, phooey qchris ^! [21:13:26] cp3022 :/ [21:13:29] i gotta run, tty monday [21:13:42] :-/ [21:14:00] PROBLEM - Varnishkafka Delivery Errors on cp3017 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 149.991669 [21:14:10] Ok. See you on Monday ottomata. [21:14:50] PROBLEM - Varnishkafka Delivery Errors on cp3003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 550.56665 [21:15:40] RECOVERY - Varnishkafka Delivery Errors on cp3006 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:17:30] PROBLEM - Varnishkafka Delivery Errors on cp3007 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 130.850006 [21:18:00] RECOVERY - Varnishkafka Delivery Errors on cp3003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:18:00] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 232.0 [21:18:19] PROBLEM - Varnishkafka Delivery Errors on cp3004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 283.049988 [21:19:29] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:19:53] !log Ran kafka leader re-election to bring analytics1021 back into the set of leaders [21:19:57] Logged the message, Master [21:20:19] RECOVERY - Varnishkafka Delivery Errors on cp3017 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:21:20] RECOVERY - Varnishkafka Delivery Errors on cp3004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:23:00] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6224.66870768 [21:24:10] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:26:50] RECOVERY - Varnishkafka Delivery Errors on cp3007 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:27:19] PROBLEM - Varnishkafka Delivery Errors on cp3003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1844.18335 [21:36:31] RECOVERY - Varnishkafka Delivery Errors on cp3003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:37:47] well at least they mostly seem recovered [21:39:50] I think they all recovered (for now). [21:57:10] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [22:15:09] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:22:50] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 618876 msg (=400000 warning): ocg_render_job_queue 5015 msg (=3000 critical) [22:23:19] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 622061 msg (=400000 warning): ocg_render_job_queue 7862 msg (=3000 critical) [22:23:20] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 622400 msg (=400000 warning): ocg_render_job_queue 8126 msg (=3000 critical) [22:35:38] cscott_away: ping?