[00:07:19] <^d> hoo: Anyway, it's not done but here's enough to give you: http://p.defau.lt/?iO7VNWKrhZmOJwAVPgi9UA [00:10:17] !log LocalisationUpdate completed (1.23wmf12) at 2014-01-31 00:10:17+00:00 [00:10:25] Logged the message, Master [00:19:28] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-01-31 00:19:28+00:00 [00:19:35] Logged the message, Master [00:19:39] !log xtrabackup clone db1042 to db1020 [00:19:46] Logged the message, Master [00:24:26] (03PS4) 10TTO: Give testwiki some custom namespaces [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 [00:25:47] (03PS5) 10TTO: Give testwiki some custom namespaces [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 [00:31:23] yurikR: you have what? [00:31:46] (sorry, was in training, just finished, have a call with robla in, well, -1 minutes) [00:32:40] (03PS1) 10Andrew Bogott: Move a few more things from virt1005 to labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/110476 [01:03:31] (03PS1) 10BryanDavis: logstash: Improve filters [operations/puppet] - 10https://gerrit.wikimedia.org/r/110483 [01:10:23] (03CR) 10BryanDavis: logstash: Improve filters (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/110483 (owner: 10BryanDavis) [01:23:39] (03PS1) 10Springle: depool db1050 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110487 [01:24:06] (03CR) 10Springle: [C: 032] depool db1050 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110487 (owner: 10Springle) [01:24:59] (03Merged) 10jenkins-bot: depool db1050 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110487 (owner: 10Springle) [01:25:34] !log springle synchronized wmf-config/db-eqiad.php 'depool db1050 for schema changes' [01:25:45] Logged the message, Master [01:28:09] Reedy: What was up with Jenkins yesterday? (around 2014-01-30 18:21:28 +00:00) [01:37:39] anyone around who could sync a Wikidata update? [01:37:58] +1 [02:01:50] !log LocalisationUpdate completed (1.23wmf11) at 2014-01-31 02:01:50+00:00 [02:02:02] Logged the message, Master [02:02:29] (03PS1) 10Andrew Bogott: Don't hotpatch the virtdriver for havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110492 [02:02:37] !log LocalisationUpdate completed (1.23wmf12) at 2014-01-31 02:02:37+00:00 [02:02:44] Logged the message, Master [02:07:39] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-01-31 02:07:39+00:00 [02:07:52] Logged the message, Master [02:13:12] (03PS2) 10Andrew Bogott: Move a few more things from virt1005 to labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/110476 [02:13:14] (03PS2) 10Andrew Bogott: Don't hotpatch the virtdriver for havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110492 [02:16:18] (03CR) 10Andrew Bogott: [C: 032] Move a few more things from virt1005 to labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/110476 (owner: 10Andrew Bogott) [02:16:29] (03CR) 10Andrew Bogott: [C: 032] Don't hotpatch the virtdriver for havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110492 (owner: 10Andrew Bogott) [02:51:17] (03PS1) 10Andrew Bogott: Add nova-conductor to havana installs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110502 [02:55:46] TimStarling: Got a few minutes to quickly deploy a Wikidata update? Only need to +2 https://gerrit.wikimedia.org/r/110488 and sync it [02:56:22] sounds scary [02:56:38] TimStarling: :P Only includes two minor fixes [02:59:43] ori: --^ Tested on beta... [03:06:03] (03PS2) 10Andrew Bogott: Add nova-conductor to havana installs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110502 [03:08:05] (03CR) 10Andrew Bogott: [C: 032] Add nova-conductor to havana installs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110502 (owner: 10Andrew Bogott) [03:12:59] PROBLEM - DPKG on virt1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [03:14:58] my laptop just turned itself off for some reason [03:14:59] RECOVERY - DPKG on virt1003 is OK: All packages OK [03:16:10] Jan 31 14:08:03 shimmer kernel: [18504.648389] thermal_sys: Critical temperature reached (102 C), shutting down [03:16:27] yikes [03:16:58] I was compiling hiphop [03:17:59] PROBLEM - DPKG on virt1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [03:18:03] Sounds fun :P [03:18:56] so I was going to do your update, and I still can do it [03:18:59] RECOVERY - DPKG on virt1003 is OK: All packages OK [03:19:07] I just want to be fairly sure the laptop isn't going to turn itself off again first [03:20:17] Makes sense... I'll go to bed now, but aude will still be around [03:20:21] if you can sync, that would be great [03:21:10] TimStarling: you still maintain the Special pagea on enwp? [03:21:39] I'm particularly interested in Special:UncategorizedPages atm [03:22:09] Trying to track down a bug where pages only in hidden cats show up.. [03:22:19] PROBLEM - DPKG on virt1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [03:23:19] RECOVERY - DPKG on virt1002 is OK: All packages OK [03:29:31] i may need some throttle.php deploys tomorrow or saturday. (IP info may trickle in). anyone planning to be here saturday by chance? [03:29:39] events spanning 4+ cities [03:29:59] (feel free to blame Pharos for the late warning) [03:30:38] * jeremyb is telling him to file bugs now... [03:31:51] http://paste.tstarling.com/p/bhUBQv.html [03:31:55] looks pretty badly broken to me [03:32:00] :/ [03:32:08] wtf [03:32:22] i was about to say wikidata and look who's here! [03:32:24] :-) [03:33:26] yeah, I'm doing an update for aude and hoo [03:33:40] I can probably revert it for now [03:37:14] !log tstarling started scap: mostly no-op, did a reset to revert the wikidata changes [03:37:22] Logged the message, Master [03:38:49] we'll investigate and fix ready tomorrow [03:39:55] i can't reproduce the issue on beta [03:40:01] but things aren't exactly same [03:40:56] !log tstarling finished scap: mostly no-op, did a reset to revert the wikidata changes (duration: 05m 00s) [03:41:04] Logged the message, Master [03:50:49] RECOVERY - Host labnet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [03:51:38] Soo... I just had a thought (you can trout me now or wait til I make a fool of myself)... [03:52:59] PROBLEM - DPKG on labnet1001 is CRITICAL: Connection refused by host [03:53:06] ^ me again [03:53:10] What's the plan for all of the old toolserver equipment that's being freed up as stuff moves to labs? Could some of that equipment be dedicated to maintaining the Special page reports on the projects, like daily? [03:53:19] PROBLEM - RAID on labnet1001 is CRITICAL: Connection refused by host [03:53:19] PROBLEM - puppet disabled on labnet1001 is CRITICAL: Connection refused by host [03:53:29] PROBLEM - SSH on labnet1001 is CRITICAL: Connection refused [03:53:39] PROBLEM - Disk space on labnet1001 is CRITICAL: Connection refused by host [03:53:50] T13|sleeps: I don't know where that hardware is bound… but I think most of it is Solaris, so we won't especially want it in a WMF datacenter. [03:53:59] there is already equipment in tampa that could be used for that, and I have said that it should be [03:54:37] Is there a ticket on that I can cc? [03:57:52] andrewbogott: The Sun boxes can run Debian as well I believe. But for a handful machines that are x years old it's probably best to donate them to another project (or sell them if they're worth the trouble). [03:58:38] T13|sleeps: well, there is bug 15434 [03:59:40] on https://gerrit.wikimedia.org/r/#/c/33713/ I said they should be run more often if we have the servers for it [04:00:21] !b 15434 [04:00:21] https://bugzilla.wikimedia.org/15434 [04:01:11] the version that was eventually merged was to run them monthly [04:01:57] Hi YuviPanda [04:03:47] TimStarling: Got a few minutes to quickly deploy a Wikidata update? Only need to +2 https://gerrit.wikimedia.org/r/110488 and sync it [04:03:53] aude: obviously that comment was a jinx [04:04:16] I think every time someone asks me to deploy a really easy change because it's really really easy, it turns out to be fantastically complicated [04:05:09] PROBLEM - NTP on labnet1001 is CRITICAL: NTP CRITICAL: No response from NTP server [04:09:35] now, I think it might be about time to clean the inside of my laptop, bbl [04:16:29] RECOVERY - SSH on labnet1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:31:39] RECOVERY - Disk space on labnet1001 is OK: DISK OK [04:31:59] RECOVERY - DPKG on labnet1001 is OK: All packages OK [04:32:19] RECOVERY - RAID on labnet1001 is OK: NRPE: Unable to read output [04:32:19] RECOVERY - puppet disabled on labnet1001 is OK: OK [04:43:35] (03PS1) 10Andrew Bogott: Revert "fixing typo - taged to tagged" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110509 [04:43:37] (03PS1) 10Andrew Bogott: Revert "Creating network config for eqiad nova network controller" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110510 [04:44:06] LeslieCarr: ^ agreed? [04:47:48] (03CR) 10Andrew Bogott: [C: 032] Revert "fixing typo - taged to tagged" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110509 (owner: 10Andrew Bogott) [04:47:58] (03CR) 10Andrew Bogott: [C: 032] Revert "Creating network config for eqiad nova network controller" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110510 (owner: 10Andrew Bogott) [04:49:56] (03PS1) 10Springle: repool db1020, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110511 [04:49:59] RECOVERY - NTP on labnet1001 is OK: NTP OK: Offset -0.04608952999 secs [04:52:16] (03CR) 10Springle: [C: 032] repool db1020, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110511 (owner: 10Springle) [04:52:23] (03Merged) 10jenkins-bot: repool db1020, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110511 (owner: 10Springle) [04:57:09] !log springle synchronized wmf-config/db-eqiad.php 'repool db1020, warm up' [04:57:18] Logged the message, Master [05:01:27] (03PS1) 10Springle: depool db1052 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110512 [05:02:36] (03CR) 10Springle: [C: 032] depool db1052 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110512 (owner: 10Springle) [05:02:42] (03Merged) 10jenkins-bot: depool db1052 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110512 (owner: 10Springle) [05:03:43] !log springle synchronized wmf-config/db-eqiad.php 'depool db1052 for schema changes' [05:03:52] Logged the message, Master [05:13:10] well, not much dust came out, but now it plateaus at about 70°C instead of rising rapidly past 80, so I guess that is job done [05:32:48] TimStarling: Picked it apart completely? On my box the dust accumulates between the ventilator and the cooling fins, on the "inner" side (luv? lee?). So the outside of the cooling fins look fine, yet the air flow is almost blocked. [05:42:47] just took the keyboard off, sprayed the top of the motherboard and RAM with compressed gas, and then sprayed all the vents [05:43:53] spraying the vents is probably what did it, some dust came back out [05:47:59] by compressed gas I mean http://au.element14.com/jsp/search/productdetail.jsp?SKU=1895387 [05:48:01] handy stuff [05:48:27] is that the kind of thing you do at the same time as changing fire alarm batteries? :) [05:48:55] probably should :) [05:49:18] (of course i really mean smoke alarm...) [06:13:49] (03PS1) 10Tim Landscheidt: Rename last references of labsconsole to wikitech [operations/puppet] - 10https://gerrit.wikimedia.org/r/110513 [06:24:17] (03CR) 10Jeremyb: [C: 04-1] "This was after already having split it up some. (e.g. jobs/chapters)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (owner: 10Jeremyb) [06:26:29] (03CR) 10Jeremyb: "faidon says:" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106110 (owner: 10Jeremyb) [06:30:00] (03CR) 10Jeremyb: "err, s/no fixed/now fixed/" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (owner: 10Jeremyb) [06:33:52] (03CR) 10Jeremyb: "also, some of these changes are a prerequisites for moving redirects into redirects.dat. commit msg says "removed some redirect hops". oth" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (owner: 10Jeremyb) [06:50:00] (03PS1) 10Springle: db1020 full steam. depool db1011 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110514 [06:50:33] (03CR) 10Springle: [C: 032] db1020 full steam. depool db1011 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110514 (owner: 10Springle) [06:50:39] (03Merged) 10jenkins-bot: db1020 full steam. depool db1011 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110514 (owner: 10Springle) [06:51:42] !log springle synchronized wmf-config/db-eqiad.php 'db1020 full steam. depool db1011 for schema changes' [06:51:50] Logged the message, Master [06:52:44] (03CR) 10Andrew Bogott: [C: 032] "This is great, thank you!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110513 (owner: 10Tim Landscheidt) [06:58:00] (03PS1) 10Springle: repool db1052, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110515 [06:58:21] (03CR) 10Springle: [C: 032] repool db1052, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110515 (owner: 10Springle) [06:58:26] (03Merged) 10jenkins-bot: repool db1052, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110515 (owner: 10Springle) [06:59:31] !log springle synchronized wmf-config/db-eqiad.php 'repool db1052, warm up' [06:59:39] Logged the message, Master [08:16:52] lo [09:10:27] (03PS1) 10Hashar: beta: fatal email should say 'last twelve hours' [operations/puppet] - 10https://gerrit.wikimedia.org/r/110519 [09:11:25] (03PS1) 10Andrew Bogott: Replace and fix the havana libvirt driver hotpatch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110520 [09:13:58] (03CR) 10Andrew Bogott: [C: 032] Replace and fix the havana libvirt driver hotpatch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/110520 (owner: 10Andrew Bogott) [09:19:03] (03CR) 10Hashar: "Some packages are already installed by other components, that would cause puppet to complain about duplicate packages :-(" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/90684 (owner: 10Diederik) [09:56:25] (03PS1) 10Nemo bis: Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 [09:56:37] matanya and akosiaris_away, patch for you :) [09:57:03] (03CR) 10jenkins-bot: [V: 04-1] Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 (owner: 10Nemo bis) [09:58:13] (03PS2) 10Nemo bis: Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 [09:58:49] (03CR) 10jenkins-bot: [V: 04-1] Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 (owner: 10Nemo bis) [09:59:40] grr [10:12:37] Reedy: Around? [10:19:45] (03Abandoned) 10Odder: Close wikimania2013 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104726 (owner: 10Odder) [10:30:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [10:54:04] is anyone looking into the fatals and exceptions we've had in prod for the past 24h? [10:58:09] ori: I am not [10:58:21] we got a new wmf version deployed yesterday I think [10:58:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [10:59:30] * hashar loves logstash [11:00:05] Invalid argument supplied for foreach() in /usr/local/apache/common-local/php-1.23wmf11/includes/filebackend/SwiftFileBackend.php on line 648 [11:00:05] Invalid argument supplied for foreach() in /usr/local/apache/common-local/php-1.23wmf11/includes/filebackend/SwiftFileBackend.php on line 180 [11:00:06] :( [11:00:18] mmm, i still prefer grepping files, but i want to like logstash better [11:00:23] waiting for the setup to improve [11:00:39] 220 and 218 occurrences over the last 12 hours [11:00:50] and 3606 occurrences of Recursion detected in RequestContext::getLanguage in /usr/local/apache/common-local/php-1.23wmf11/includes/context/RequestContext.php on line 318 [11:01:35] the Recursion one is known; it has to do with the initialization of the user object. There's a (very) lengthy discussion in Bugzilla IIRC and bd808|BUFFER intends to fix it [11:02:24] i am going to bed tho, passing out. bye! thanks for checking hashar [11:02:38] ori: have sweet dreams! [11:03:29] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [11:08:29] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 534.366638 [11:08:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3041.699951 [11:08:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2969.533447 [11:08:49] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 819.533325 [11:08:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 824.866638 [11:09:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3075.199951 [11:09:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 551.06665 [11:09:19] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.625 [11:09:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 544.0 [11:09:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 550.466675 [11:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.466667 [11:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.666667 [11:09:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3054.93335 [11:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.764706 [11:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.675676 [11:09:39] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.658537 [11:09:49] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [11:10:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:10:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:10:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [11:11:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:11:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:13:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4991.133301 [11:13:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3023.56665 [11:13:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3029.5 [11:13:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5024.866699 [11:13:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5050.066895 [11:14:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3085.466553 [11:14:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5002.266602 [11:14:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3032.800049 [11:15:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:15:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:15:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:15:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:16:26] what happened with the broker? [11:18:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4991.700195 [11:18:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5020.133301 [11:18:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4992.0 [11:18:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [11:19:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4939.366699 [11:20:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:20:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:21:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:21:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:22:49] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:19] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:29] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:29] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:39] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:49] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:23:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:24:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:24:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:24:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [11:27:29] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [11:29:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [11:29:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 853.533325 [11:30:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 559.266663 [11:30:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 560.099976 [11:30:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [11:30:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.625 [11:30:29] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [11:30:29] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 571.333313 [11:30:39] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.575758 [11:30:49] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 847.333313 [11:30:49] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.7 [11:31:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 575.099976 [11:31:19] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.625 [11:31:29] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.4 [11:31:29] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.542857 [11:31:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [11:35:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4991.100098 [11:35:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3179.133301 [11:35:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3235.366699 [11:35:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3226.666748 [11:35:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5077.433105 [11:35:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5082.600098 [11:36:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3302.56665 [11:36:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5044.700195 [11:37:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:37:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:37:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:37:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:40:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4972.633301 [11:40:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5020.233398 [11:40:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5049.033203 [11:41:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5039.0 [11:42:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:42:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:42:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:42:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:45:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4989.033203 [11:45:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5019.0 [11:45:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4982.100098 [11:46:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4986.766602 [11:47:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:47:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:47:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:47:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:49:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [11:50:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4984.299805 [11:50:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5007.433105 [11:50:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4032.300049 [11:50:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5047.5 [11:51:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:51:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:51:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:51:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:51:49] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:51:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:51:49] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:19] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:29] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:29] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:39] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:52:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:53:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:53:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:53:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:54:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:55:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:55:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:56:29] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [11:59:29] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [11:59:49] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:01:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 8.0 [12:02:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:02:59] I rebooted virt1001 just to mix up the icinga spam [12:03:49] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [12:03:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [12:04:23] :) [12:08:39] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.75 [12:08:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 877.599976 [12:08:49] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 884.799988 [12:09:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 629.666687 [12:09:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3508.899902 [12:09:19] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.666667 [12:09:21] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 620.200012 [12:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.648649 [12:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.647059 [12:09:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 635.900024 [12:09:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3457.0 [12:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.633333 [12:09:29] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:29] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.633333 [12:09:30] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 641.766663 [12:09:31] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3485.56665 [12:09:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3454.800049 [12:09:49] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [12:10:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [12:13:15] (03PS3) 10Nemo bis: Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 [12:13:39] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [12:13:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5109.299805 [12:13:52] (03CR) 10jenkins-bot: [V: 04-1] Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 (owner: 10Nemo bis) [12:14:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4967.266602 [12:14:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4987.399902 [12:14:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5004.799805 [12:15:39] (03PS4) 10Nemo bis: Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 [12:15:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:15:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:15:51] now why did local puppet parser validate --trace mail.pp not catch that, hmpf [12:16:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:16:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:16:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:18:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [12:18:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [12:19:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5054.633301 [12:19:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5016.133301 [12:19:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5053.766602 [12:19:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5062.166504 [12:20:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:21:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:21:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:21:29] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:21:39] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:21:49] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:21:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:19] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:29] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:49] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:22:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:23:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:23:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [12:26:29] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [12:27:39] PROBLEM - NTP on virt1001 is CRITICAL: NTP CRITICAL: Offset unknown [12:28:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 890.366638 [12:29:19] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [12:29:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 644.900024 [12:29:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 646.966675 [12:29:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.633333 [12:29:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 657.43335 [12:29:29] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.677419 [12:29:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.606061 [12:29:29] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [12:29:29] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 665.299988 [12:29:29] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.558824 [12:29:39] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.636364 [12:29:49] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.684211 [12:29:49] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 885.333313 [12:30:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [12:33:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5001.133301 [12:34:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3830.266602 [12:34:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5131.766602 [12:34:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3794.133301 [12:34:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5095.100098 [12:34:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3770.56665 [12:34:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3745.466553 [12:34:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5118.733398 [12:35:09] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:35:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:35:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:36:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:38:39] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [12:38:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5206.566895 [12:39:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5100.633301 [12:39:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5047.200195 [12:39:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5082.533203 [12:40:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:40:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:40:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:41:09] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:41:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:41:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:41:49] PROBLEM - Host virt1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5174.700195 [12:44:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5094.533203 [12:44:29] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [12:44:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4947.333496 [12:44:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3732.533447 [12:44:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5051.399902 [12:45:09] RECOVERY - Host virt1003 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:45:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:45:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:45:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:45:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:45:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:48:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5146.633301 [12:49:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5115.066895 [12:49:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5137.899902 [12:49:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5085.399902 [12:49:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [12:50:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:50:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.7 [12:50:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:50:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:51:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:51:49] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:19] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:29] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:29] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:39] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:49] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:52:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:53:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:53:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:53:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:53:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [12:57:29] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [12:59:19] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [12:59:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 741.43335 [12:59:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 732.56665 [12:59:29] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.625 [12:59:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [12:59:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 742.56665 [12:59:29] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 729.0 [12:59:29] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [12:59:39] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [12:59:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 891.93335 [12:59:49] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [12:59:49] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 899.266663 [13:00:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.7 [13:00:29] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [13:00:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [13:04:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4260.733398 [13:04:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5140.266602 [13:04:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4239.866699 [13:04:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5112.366699 [13:04:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4237.933105 [13:04:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4214.233398 [13:04:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5121.633301 [13:04:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5115.666504 [13:05:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:05:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:05:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:06:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:06:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:06:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:06:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:08:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [13:09:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5214.266602 [13:09:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3999.466553 [13:09:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:09:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5108.666504 [13:09:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4082.966553 [13:09:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5147.166504 [13:09:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5135.200195 [13:11:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:11:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:11:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:11:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:11:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:11:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:11:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:14:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5157.933105 [13:14:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4034.600098 [13:14:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [13:14:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5088.200195 [13:14:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4069.133301 [13:14:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4069.100098 [13:14:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5079.166504 [13:14:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5074.399902 [13:15:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:16:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:16:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:16:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:16:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:18:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [13:19:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5077.733398 [13:19:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5025.600098 [13:19:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4131.200195 [13:19:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5035.966797 [13:19:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5060.266602 [13:21:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:39] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:49] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:19] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:29] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:29] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:22:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [13:24:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5030.399902 [13:26:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:28:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 736.5 [13:28:29] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.633333 [13:28:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.625 [13:28:29] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 735.06665 [13:28:39] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.605263 [13:28:49] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [13:28:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 911.333313 [13:29:19] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.636364 [13:29:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 746.599976 [13:29:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 740.133362 [13:29:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:29:29] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [13:29:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5030.066895 [13:29:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [13:30:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [13:30:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:32:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.631579 [13:33:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4599.733398 [13:33:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5256.100098 [13:33:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4703.033203 [13:33:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5285.233398 [13:34:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4627.0 [13:34:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5188.200195 [13:34:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4569.799805 [13:34:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5075.299805 [13:35:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:35:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:35:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:35:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:35:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:35:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:36:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:37:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:38:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5289.266602 [13:38:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5323.733398 [13:38:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4530.466797 [13:38:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4526.700195 [13:38:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5319.755371 [13:39:20] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4582.766602 [13:39:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5083.366699 [13:39:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:40:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:40:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:40:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.692308 [13:40:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:40:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:40:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:41:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:43:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5265.466797 [13:43:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4400.933105 [13:43:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:43:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5315.200195 [13:43:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4523.733398 [13:43:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5349.533203 [13:44:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4575.100098 [13:44:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:44:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5075.833496 [13:45:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:45:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:45:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:45:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:45:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:46:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:46:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.625 [13:47:36] andrewbogott: are you able to access labnet1001 now? [13:47:56] I was able to so I am assuming you figured it out...server was left off [13:48:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4439.333496 [13:48:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5364.833496 [13:48:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4521.233398 [13:48:30] cmjohnson1: I am! I wanted to reinstall the OS anyway. [13:48:43] cmjohnson1: I assume there must be software config bits left for the new nics [13:48:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5337.566895 [13:48:54] But, will probably lean on mark or LeslieCarr for that [13:49:11] there isn't a new nic right now...the spare i had is not the right type for that server [13:49:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4423.100098 [13:49:24] we have to order a new one...which i believe is waiting on approval from mark [13:49:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5150.5 [13:49:35] i approved that [13:49:45] okay...then we're good to go as soon as it gets here [13:49:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5065.200195 [13:50:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:50:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:50:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:50:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:50:31] cmjohnson1: Oh! Ok, so it's the same as before then? [13:50:35] wooo weee [13:50:38] kafka is unhappy [13:50:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:50:42] hmm [13:50:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:51:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:51:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:52:16] andrewbogott: yes nothing has changed yet [13:52:22] ok [13:53:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4450.233398 [13:53:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:53:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5324.066895 [13:53:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4613.0 [13:53:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5327.200195 [13:54:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4572.133301 [13:54:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5124.133301 [13:54:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4390.566895 [13:54:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5073.0 [13:55:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:55:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:55:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:55:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:55:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:55:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:56:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:56:29] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [13:56:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.65 [13:56:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [13:58:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5306.833496 [13:58:29] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4354.200195 [13:58:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5308.966797 [13:58:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4489.0 [13:58:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5340.0 [13:59:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4573.233398 [13:59:29] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [13:59:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5079.899902 [13:59:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [14:00:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:00:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:00:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:00:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:00:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:00:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [14:01:10] i am workign on this, but that is a really annoying state for this to be in [14:01:13] yes i know it is broken! [14:01:15] shhh [14:01:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:01:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:19] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:20] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:29] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:29] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:29] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [14:03:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:29] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5252.0 [14:03:39] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:39] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 10.7 [14:03:49] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:03:49] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:04:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:04:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:04:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:04:39] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:04:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [14:04:59] PROBLEM - Kafka Broker Messages In on analytics1022 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [14:05:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:06:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:07:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 9.433333 [14:07:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 21.466667 [14:07:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 27.033333 [14:07:29] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 25.1 [14:07:29] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 8.4 [14:08:29] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:09:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:10:09] PROBLEM - check google safe browsing for wiktionary.org on google is CRITICAL: Connection timed out [14:10:28] !log upping noflie open file limit on analytics1021 and analytics1022, rebooting [14:10:36] Logged the message, Master [14:10:49] PROBLEM - check google safe browsing for wikibooks.org on google is CRITICAL: Connection timed out [14:10:59] PROBLEM - check google safe browsing for wikiquotes.org on google is CRITICAL: Connection timed out [14:11:09] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: Connection timed out [14:11:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:11:19] PROBLEM - check google safe browsing for wikipedia.org on google is CRITICAL: Connection timed out [14:11:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:11:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:11:49] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:49] PROBLEM - Host analytics1022 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:49] RECOVERY - check google safe browsing for wikiquotes.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3838 bytes in 0.093 second response time [14:12:09] RECOVERY - check google safe browsing for wikipedia.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3920 bytes in 0.088 second response time [14:12:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:12:59] RECOVERY - check google safe browsing for wikinews.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3836 bytes in 0.087 second response time [14:13:49] PROBLEM - check google safe browsing for mediawiki.org on google is CRITICAL: Connection timed out [14:13:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [14:13:59] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:14:59] PROBLEM - check google safe browsing for wikiversity.org on google is CRITICAL: Connection timed out [14:15:09] PROBLEM - check google safe browsing for wikisource.org on google is CRITICAL: Connection timed out [14:16:59] RECOVERY - Kafka Broker Messages In on analytics1022 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 35448.4 [14:17:39] RECOVERY - check google safe browsing for mediawiki.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3840 bytes in 0.087 second response time [14:18:09] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: Connection timed out [14:18:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 943.400024 [14:18:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [14:18:59] RECOVERY - check google safe browsing for wikisource.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3845 bytes in 0.090 second response time [14:19:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:19:19] PROBLEM - check google safe browsing for wikimedia.org on google is CRITICAL: Connection timed out [14:19:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [14:20:09] RECOVERY - check google safe browsing for wikimedia.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 4150 bytes in 0.102 second response time [14:20:59] RECOVERY - check google safe browsing for wikinews.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3836 bytes in 0.086 second response time [14:20:59] RECOVERY - check google safe browsing for wiktionary.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3923 bytes in 0.116 second response time [14:21:49] RECOVERY - check google safe browsing for wikibooks.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3841 bytes in 0.097 second response time [14:21:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.066667 [14:21:49] RECOVERY - check google safe browsing for wikiversity.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3848 bytes in 0.087 second response time [14:21:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:22:49] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [14:23:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:26:19] PROBLEM - Host analytics1022 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:09] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:27:49] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [14:27:52] heya cmjohnson1, you there? [14:28:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 844.966675 [14:28:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 840.0 [14:28:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 853.633362 [14:28:29] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.625 [14:28:29] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.595238 [14:28:29] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.666667 [14:28:29] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 835.43335 [14:28:39] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [14:28:49] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.606061 [14:28:49] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 969.400024 [14:28:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 964.133362 [14:29:19] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.611111 [14:29:29] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:29:29] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.6 [14:29:49] PROBLEM - Varnishkafka Delivery Errors on cp3014 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.016667 [14:30:29] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:30:49] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:31:19] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:31:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:31:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:31:29] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:31:29] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:31:29] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:31:39] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:32:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:33:49] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:34:49] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:35:19] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 38.033333 [14:35:19] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.8 [14:35:29] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 50.966667 [14:36:19] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:36:19] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:36:29] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:36:49] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:43:22] https://rt.wikimedia.org/Ticket/Display.html?id=6740 [14:44:03] ok analytics1022 is back up and handling data [14:44:06] analytics1021 wont' boot [14:45:02] !log analytics1022 back up with higher nofile ulimit, now handling all kafka traffic, analytics1021 wont' boot [14:45:11] Logged the message, Master [14:46:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 16.799999 [14:48:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:49:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 887.766663 [14:51:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 50.400002 [14:53:58] ottomata: hey [14:55:03] hey, got your pm, thanks! [14:55:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 198.800003 [14:56:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:00:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:12:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 45.266666 [15:14:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:15:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:16:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [15:18:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 16.766666 [15:18:49] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 47.400002 [15:21:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:24:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 59.733334 [15:27:53] (03CR) 10Chad: [C: 031] Fix new Elasticsearch monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/110464 (owner: 10Manybubbles) [15:28:04] <^d> ottomata: You think we could get ^ out today? [15:28:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 430.866669 [15:28:59] sho [15:29:02] (03PS2) 10Manybubbles: Fix new Elasticsearch monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/110464 [15:29:08] (03CR) 10Ottomata: [C: 032 V: 032] Fix new Elasticsearch monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/110464 (owner: 10Manybubbles) [15:29:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:32:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 33.200001 [15:33:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:36:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 65.0 [15:40:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:43:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 27.6 [15:44:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:47:29] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 32.166668 [15:50:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 100.033333 [15:52:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:00:29] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:01:16] !log stopping puppet on cp3019 to experiement with varnishkafka buffer levels [16:01:23] Logged the message, Master [16:04:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 212.53334 [16:08:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:11:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 137.199997 [16:12:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:13:12] hashar: https://gerrit.wikimedia.org/r/#/c/109690/ should fix the invalid argument for foreach() in Swift errors. The patch should have made it into wmf12. [16:13:24] <^d> That's...annoying :\ [16:15:45] The "recursion detected" crap is https://bugzilla.wikimedia.org/show_bug.cgi?id=54193 and needs love from someone who has a strong grasp of Setup.php and the early User lifecycle [16:16:05] <^d> Yeah, we talked about that monday. [16:16:13] * bd808 nods [16:17:37] * hashar gives up on trying to handle a debian package [16:17:49] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 16.799999 [16:18:00] bd808: I wish we could catch such issues during testing :(- [16:18:04] Invalid parameter for message "parentheses" is https://bugzilla.wikimedia.org/show_bug.cgi?id=59124 is probably a small bug deep in Linker [16:18:40] gwicke: ping [16:19:29] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 103.866669 [16:20:17] <^d> bd808: That's quite possibly the worst hook ever :\ [16:20:29] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:20:49] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:21:44] (03PS1) 10Ottomata: Upping varnishkafka produce buffer to keep up with traffic on bits servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/110583 [16:22:00] (03CR) 10Ottomata: [C: 032 V: 032] Upping varnishkafka produce buffer to keep up with traffic on bits servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/110583 (owner: 10Ottomata) [16:22:28] ^d: The preg_replace_callback stuff in there is icky. I think it could be made a lot cleaner with some closures instead of object state hacks. [16:22:41] !log reenabling puppet on cp3019 [16:22:48] Logged the message, Master [16:25:19] <^d> bd808: preg? I must be looking at a different file...? [16:26:28] <^d> Oh duh, I was still talking about UserLoadFromSession [16:26:30] <^d> You had moved on :p [16:26:40] ^d: I was thiking of Linker::formatAutocomments and ::formatLinksInComment [16:26:52] <^d> Yeah, I'm caught up now. [16:27:08] UserLoadFromSession made my head hurt [16:27:11] <^d> UserLoadFromSession is such a rabbit hole. [16:27:15] <^d> It's obvious how it's broken. [16:27:48] <^d> But the solution isn't so obvious....beyond yanking the hook out [16:28:08] Hard code all the things! [16:31:19] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: -889369.1875 [16:31:49] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:37:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Inline comments" (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107492 (owner: 10GWicke) [16:42:29] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:39] bd808: https://www.mediawiki.org/wiki/Requests_for_comment/Linker_refactor :) [16:42:57] needs more polishing [16:43:15] ottomata: is analytics1003 you ? [16:43:19] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:43:24] yes [16:43:25] sorry [16:43:28] it me [16:43:31] i am checking ulimits there [16:43:33] ok :-) [16:43:35] an03 and an04 are test brokers anyway [16:43:38] not in production [16:43:40] should have logged, sorry [16:43:46] !log rebooting an03 to test kafka ulmits [16:43:53] Logged the message, Master [16:44:19] aude: s/more polishing/to be implemented/ [16:45:49] RECOVERY - Host analytics1003 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [16:52:18] bd808: yep [16:52:55] * aude thinks autocomments code was written by erik ~10 years ago [16:53:10] probably state of the art then, but ... :) [16:53:53] Time rots most code [16:54:37] yep [16:55:58] !log rebooting analytics1003 [16:56:07] Logged the message, Master [16:57:31] (03PS4) 10GWicke: WIP: Phase 1 of moving to new parsoid repo / upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/107492 [16:58:09] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:47] (03CR) 10GWicke: WIP: Phase 1 of moving to new parsoid repo / upstart (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107492 (owner: 10GWicke) [17:01:41] akosiaris: ping [17:02:49] RECOVERY - Host analytics1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:05:53] gwicke: pong [17:06:30] akosiaris, I amended the patch & replied to your comments [17:07:09] (03PS1) 10Ottomata: Adding ability to set ulimit nofiles, increased to 8192 by default for kafka server [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/110590 [17:07:10] the idea in general with this changeset is to get things ready for testing [17:08:07] when the repo is initialized, pushed out with git-deploy & upstart manually verified to be working, the follow-up changeset can then actually activate the codebase & upstart [17:08:36] (03PS2) 10Ottomata: Adding ability to set ulimit nofiles, increased to 8192 by default for kafka server [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/110590 [17:08:57] akosiaris: ^ [17:11:16] gwicke: ok looking at your answers now. Sorry it is taking long, crappy hotel connection [17:11:51] akosiaris, np; thanks for tackling this! [17:13:14] gwicke: what i meant with service_name => parsoid-test is that IIRC git-deploy will try to restart the service, so if you specify parsoid there, it will restart the old service. I guess that is not what you want, right ? [17:14:27] heh... i should have paid more attention the the checkout_submodules thing. Now it makes more sense. [17:14:29] restarts were disabled in the old service [17:14:57] we didn't touch that, so I was assuming that it still won't restart [17:15:20] there is a separate salt command for the restart [17:15:37] hmmm let me check [17:18:08] (03PS3) 10Ottomata: Adding ability to set ulimit nofiles, increased to 8192 by default for kafka server [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/110590 [17:19:26] gwicke: which nodejs package were you referring to ? the nodejs on wtp10* (ver: 0.8.2-1chl1~precise1) does all the alternatives stuff I was talking about. I assume you refer to sid/jessie 0.10.2X ? [17:19:53] we are switching to 0.10 in the next weeks [17:20:03] yeah well aware [17:20:13] which is why i mentioned it [17:20:18] nodejs is the binary at least on Debian, and works on Ubuntu as well [17:20:19] RECOVERY - Varnishkafka Delivery Errors on cp3013 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [17:20:59] both in 0.8 and 0.10 [17:21:52] dpkg -S /usr/bin/node [17:21:53] nodejs-legacy: /usr/bin/node [17:22:00] heh [17:22:02] try [17:22:05] /usr/bin/js [17:22:28] anyway... not really arguing there... [17:22:56] not installed here [17:22:59] i don't know why they changed it, but i don't like it, so I am inclined to say let's stick to the binary [17:24:02] $ls -l /usr/bin/js [17:24:02] lrwxrwxrwx 1 root root 20 Jan 31 18:21 /usr/bin/js -> /etc/alternatives/js [17:24:02] $ ls -l /etc/alternatives/js [17:24:02] lrwxrwxrwx 1 root root 15 Jan 31 18:21 /etc/alternatives/js -> /usr/bin/nodejs [17:24:07] weird... [17:24:18] that's an Ubuntuism afaik [17:24:23] nope [17:24:27] 0.10.21~dfsg1 [17:24:31] pure debian [17:24:36] odd [17:24:49] anyway this is going far and is counter-productive ... [17:25:04] feel free to explore more but we go with your suggestion [17:25:07] I'm on 0.10.25, also Debian, and don't have any /usr/bin/js [17:25:12] now i am convinced :-) [17:26:00] ok ;) [17:31:10] (03CR) 10Ottomata: "Depends on this being deployed:" [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/110592 (owner: 10Ottomata) [17:38:15] gwicke: so the service name is used in salt -G 'cluster:something something' deploy.restart 'parsoid' [17:38:58] i am getting more confident that you don't want the same string for 2 different repos there [17:39:17] even if it is just temporary [17:39:40] with restarting disabled it should not matter afaik [17:43:54] what if the salt command gets called manually ? [17:43:59] my understanding is that restarting is controlled by the grain [17:44:17] which is IIRC where Ryan disabled automatic restarts [17:44:40] salt -b 10% -G 'deployment_target:parsoid' parsoid.restart_parsoid parsoid [17:44:45] (from https://wikitech.wikimedia.org/wiki/Parsoid) [17:47:42] (03CR) 10Ottomata: webserver: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 (owner: 10Matanya) [17:53:06] akosiaris, that salt command calls restart_parsoid in modules/deployment/files/modules/parsoid.py [17:53:16] which does a ret = subprocess.call("/etc/init.d/parsoid restart", shell=True) [17:53:18] yeah i read the code already [17:53:37] so the service name should not matter [17:53:48] yes... mostly due to the bad call [17:55:22] ok then... so we can merge. [17:55:37] just let me put the comment there and we are done [17:56:02] ok, cool [17:57:05] I'm looking into the follow-up patch now [17:58:10] I think the main things that need tweaking is to rename the upstart config to parsoid.conf (so that upstart takes precedence) and changing (removing?) the parsoid_restart method in modules/deployment/files/modules/parsoid.py [17:59:20] jeff_green: can you review this please. https://gerrit.wikimedia.org/r/#/c/109655/ [18:00:07] (03PS5) 10Alexandros Kosiaris: WIP: Phase 1 of moving to new parsoid repo / upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/107492 (owner: 10GWicke) [18:00:21] gwicke: yeah, still a couple of stages before that. I suppose stopping old parsoid, starting new parsoid ? [18:00:55] akosiaris, ideally only manually on one depooled box [18:01:36] although keeping it pooled should be fine too [18:03:21] gwicke, so, we are not doing a real deploy today, are we? this is just a test, I assume. [18:03:54] yes, this is a test so that we can deploy on Monday [18:04:13] ok. [18:05:07] (03CR) 10Alexandros Kosiaris: [C: 032] WIP: Phase 1 of moving to new parsoid repo / upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/107492 (owner: 10GWicke) [18:05:45] meh... i should have had removed the WIP from the subject... [18:06:51] it now will be WIP forever [18:07:42] ;) [18:08:08] akosiaris, the deploy repo has code very slightly older than production (there were some fixes after the repo split) [18:08:21] so ideally we'd depool a machine for testing [18:09:46] ok choosing wtp1015 [18:09:58] if i ever manage to login... crappy hotel connections [18:12:43] the code is actually the same as the deployed one [18:13:02] just the hashes don't match as we had to transplant some fixes manually [18:13:08] (03PS3) 10Ori.livneh: kibana: Set default dashboard [operations/puppet] - 10https://gerrit.wikimedia.org/r/110457 (owner: 10BryanDavis) [18:14:40] (03PS4) 10BryanDavis: kibana: Set default dashboard [operations/puppet] - 10https://gerrit.wikimedia.org/r/110457 [18:14:45] (03CR) 10Ori.livneh: [C: 032 V: 032] kibana: Set default dashboard [operations/puppet] - 10https://gerrit.wikimedia.org/r/110457 (owner: 10BryanDavis) [18:17:19] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [18:17:31] gwicke: seems like trebuchet has not cloned the repo [18:17:38] something is not right [18:18:01] ouch, sorry to see the news about analytics1021 won't boot [18:18:15] i've had trebuchet fail silently if there are local modifications [18:18:19] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [18:18:22] unfortunately i don't have to look into it right now. Mind if we leave it for now ? I will look into it monday morning [18:18:26] or rather not fail silently but lie and say everything went ok [18:18:34] ori: niiice [18:18:57] akosiaris: pm, btw (unrelated) [18:19:45] akosiaris, ok [18:20:36] akosiaris, let me check if the repo was created on tin [18:20:59] no deploy repo yet [18:22:24] i'll run salt-call deploy.deployment_server_init [18:22:35] per [18:22:51] great that it's documented, not great that this is sufficiently regular to warrant documentation [18:23:31] gwicke: [18:23:34] # salt-call deploy.deployment_server_init [18:23:34] [INFO ] Executing command '/usr/bin/git clone https://gerrit.wikimedia.org/r/p/mediawiki/services/parsoid/deploy/.git /srv/deployment/parsoid/deploy' as user 'sartoris' in directory '/nonexistent' [18:23:34] [INFO ] Executing command 'git config deploy.tag-prefix parsoid/deploy' as user 'sartoris' in directory '/srv/deployment/parsoid/deploy' [18:24:22] that did it [18:24:55] the submodule is still empty though [18:25:22] you can't have everything [18:25:22] normally trebuchet is supposed to handle that [18:28:25] i see some kind of checkout_submodules config var [18:28:31] (03PS1) 10Cmjohnson: removing professor from dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/110600 [18:28:32] yup [18:28:33] is it set for parsoid? [18:28:42] Ryan_Lane is here, let's jump him [18:29:01] 'checkout_submodules' => true, [18:29:03] ? [18:29:20] fetching and checking out submodules is an optional thing, yes [18:29:31] deployment.pp line 97 [18:29:42] what's wrong with it? [18:29:55] something broken? [18:30:00] at least on tin the submodule is not checked out [18:30:02] (03PS1) 10Cmjohnson: Adding loudon to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110601 [18:30:10] we are just starting to test the submodule-based deploy system [18:30:23] the repo was also not created automatically [18:30:29] oh, you may need to checkout submodules there manually [18:30:39] there's an FAQ for that [18:30:41] (03CR) 10jenkins-bot: [V: 04-1] Adding loudon to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110601 (owner: 10Cmjohnson) [18:31:16] Ryan_Lane, is this any indication on whether it will work on the cluster nodes? [18:31:24] it definitely works [18:31:32] it's being used already [18:31:39] other repos use it [18:31:54] the function that clones onto tin doesn't do a submodule checkout [18:31:59] that's actually just an oversight [18:32:17] also, the automation that triggers clones on tin just happens to not be triggering [18:32:35] on the nodes the repo does not seem to be checked out either: ls /srv/deployment/parsoid/deploy [18:32:43] No such file or directory [18:32:55] did you manually clone it on tin? [18:33:05] ori ran the salt command [18:33:10] i ran salt-call deploy.deployment_server_init [18:33:19] right, so that puts it on tin [18:33:24] hiii paravoid [18:33:28] it won't be on the targets until you do a deployment [18:33:33] Ryan_Lane: sans submodules [18:33:34] oh you have a red dot for away next to your name, rats! [18:33:39] you'll need to checkout the submodules yourself on tin [18:33:53] ottomata: it actually means he's a communist [18:34:08] let me open an issue in github for deployment_server_init not handling submodules [18:34:13] ok, so in theory doing a git deploy from tin should result in repos with submodules on the cluster nodes? [18:34:18] and not triggering on tin [18:34:28] yep [18:34:29] that too [18:34:42] gwicke: it sounds like you need to init the submodules by hand on tin first [18:34:43] gwicke: you need to do: git submodule update —init [18:34:46] on tin [18:35:14] creation of new repos is still not bulletproof in the system [18:35:31] no update required, only init? [18:35:40] gwicke: you need to do: git submodule update —init [18:35:55] ah, k [18:36:01] I did the two separately [18:36:12] ok, let me try a deploy now [18:36:15] Ryan_Lane: also, Salt::Grain[deployment_global_hook_dir], Salt::Grain[deployment_repo_user], Salt::Grain[deployment_server] run on every puppet run since they don't have good conditions, causing log churn [18:36:27] Ryan_Lane: and Salt::Grain[deployment_repo_user] is still 'sartoris' [18:36:38] yeah, I have a gerrit change in for that [18:36:46] can you be around for a moment in case stuff goes wrong? [18:36:47] ori: https://gerrit.wikimedia.org/r/#/c/110239/ [18:36:50] gwicke: yes [18:37:04] ok, going for it now [18:37:34] if this is a brand new target there's a possibility the first deploy will fail [18:37:50] using --force [18:37:53] https://wikitech.wikimedia.org/wiki/Trebuchet#Initial_fetches_are_failing_.28minions_forever_pending.29 [18:37:55] (03PS1) 10Cmjohnson: removing dns entries for erzrurumi and loudon: [operations/dns] - 10https://gerrit.wikimedia.org/r/110605 [18:37:55] as there are no changes [18:37:59] yeah, you have to use —force the first time [18:38:09] parsoid/deploy: repo not found in redis [18:38:11] (03CR) 10Diederik: "Could the redirect from https://bug-attachment.wikimedia.org/ to http://bug-attachment.wikimedia.org/ also be fixed by this patchset? (see" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (owner: 10Jeremyb) [18:38:16] that's normal [18:38:16] Continue? [18:38:22] well, kind of [18:38:32] I'd check again [18:38:36] just hit enter [18:38:42] if the same problem occurs, the fetch isn't happening [18:38:48] see the link I just posted [18:38:52] https://wikitech.wikimedia.org/wiki/Trebuchet#Initial_fetches_are_failing_.28minions_forever_pending.29 [18:39:00] ori will need to run that [18:39:03] nothing changes with enter [18:39:18] once he runs that, make it retry the fetch (with r) [18:39:43] I need to open a bug with salt for that. as a workaround I'm going to make grain additions restart the salt minion [18:39:56] * Ryan_Lane adds another issue in github [18:40:19] * ori does [18:40:40] what's the appropriate node regex? [18:40:42] parsoid* ? [18:40:55] what's the grain for this repo? [18:41:12] parsoid [18:41:31] didn't that grain already exist on the systems? [18:41:43] yes [18:41:56] ori: salt -G 'deployment_target:parsoid' test.ping [18:41:58] i'm a bit confused, so i'll wait for one of you to construct the exact command i should run [18:42:01] ah, there we go [18:42:10] I just want to see if they return ping [18:42:13] if so there's another problem [18:42:39] yeah, a bunch of wtp* machines return True [18:42:55] this is on the wtp systems, right? [18:43:01] ori: salt -G 'deployment_target:parsoid' deploy.fetch '" [18:43:52] you could also target just a single node [18:43:52] what's the repo name? parsoid/deploy? [18:44:09] if that's what you called it :) [18:44:27] well, parsoid/deploy: repo not found in redis [18:44:29] so i imagine so [18:44:34] yeah [18:44:46] but the minions reply with an exception traceback for KeyError: 'parsoid/deploy' [18:44:54] ah [18:45:12] salt -G 'deployment_target:parsoid' saltutil.refresh_pillar [18:45:20] or is it refresh_pillars? [18:45:22] one of the two [18:45:22] (03CR) 10Cmjohnson: [C: 032] removing professor from dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/110600 (owner: 10Cmjohnson) [18:45:47] singular pillar [18:45:48] this is related to the trigger not occuring on tin, too [18:45:56] * gwicke watches the Ryan Deploy System Log [18:46:04] something on the puppet master isn't triggering salt refreshes [18:46:09] okay, I got a bunch of Nones, going to retry the deploy.fetch [18:46:15] (03PS2) 10Cmjohnson: Adding loudon to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110601 [18:46:33] still KeyError: 'parsoid/deploy' [18:46:53] salt -G 'deployment_target:parsoid' service.restart salt-minion [18:47:01] there's some stupid caching error [18:47:09] thank god for deployment automation [18:47:12] we need to upgrade salt. I believe this is fixed in a newer version [18:47:22] meh, this only happens on brand new repos [18:47:31] and no one other than me bothers to work on this [18:47:35] try to deploy this with scap [18:47:42] i can deploy this with scp [18:47:49] haha [18:47:53] yeah, and you'd need to write code to do it [18:48:00] (03CR) 10Cmjohnson: [C: 032] Adding loudon to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110601 (owner: 10Cmjohnson) [18:48:09] ok, i ran that, and then the fetch, and this time it appears to have worked [18:48:16] where does that put us? [18:48:25] gwicke: hit retry [18:48:34] * gwicke hits 'r [18:48:35] ' [18:48:47] no error this time [18:48:49] (03CR) 10Cmjohnson: [C: 032] removing dns entries for erzrurumi and loudon: [operations/dns] - 10https://gerrit.wikimedia.org/r/110605 (owner: 10Cmjohnson) [18:48:54] good, continue [18:49:19] 24 minions pending [18:49:31] no progress it seems [18:49:33] is the checkout state bad? [18:49:55] * gwicke shrugs [18:50:02] pastebin the errors? [18:50:14] or the output, that is [18:50:17] there are no errors, just 24/24 minions pending [18:50:37] ugh. the reporting must have an issue. I really need to switch to the new frontend [18:50:47] wtp1004.eqiad.wmnet: None (fetch: 1 [started: 0 mins, last-return: 0 mins]) [18:50:51] ah [18:50:57] 24 lines like that with 'd' [18:51:24] it looks like the fetch failed [18:51:52] ori: what was the output from one of the minions? [18:52:04] when you ran the fetch [18:52:09] wtp1006.eqiad.wmnet: [18:52:09] ---------- [18:52:09] dependencies: [18:52:11] repo: [18:52:13] parsoid/deploy [18:52:15] status: [18:52:17] 1 [18:52:25] heh. status 1 isn't 0 ;) [18:52:33] (03PS1) 10Hoo man: Don't allow less than 1 thread in mw-update-l10n [operations/puppet] - 10https://gerrit.wikimedia.org/r/110609 [18:53:02] so, a deeper level of debugging: go to wtp1006 and run: salt-call deploy.fetch 'parsoid/deploy' [18:53:09] * ori goes deeper [18:53:13] low-hanging CR ^ [18:53:29] it should output some logging into [18:53:40] it should also put some info into /var/log/salt/minion..og [18:53:42] *log [18:53:42] * gwicke thinks of Jamiroquai [18:53:53] we should really have the salt logs go into logstash [18:53:59] http://p.defau.lt/?MTQbfh_PZQaHaEC1IW6ZtQ [18:54:01] then we could just look at logstash [18:54:26] are the submodules properly initialized on the deployment system? [18:54:56] Ryan_Lane, on tin I did git submodule init followed by git submodule update [18:55:05] which fetched the submodule as expected [18:55:10] both before the deploy [18:55:11] are there any recursive submodules? [18:55:16] no [18:55:27] didn't parsoid have submodules? [18:55:39] Ryan_Lane: Salt logs to logstash sounds like a good idea. Where can we pick the salt logs up from and will they possibly have data in them that will freak Faidon out if it is visible to all wmf ldap users? [18:55:40] the old repo had [18:55:44] the new one doesn't [18:55:55] what's /srv/deployment/parsoid/deploy/node_modules/express/node_modules/qs ? [18:56:02] * gwicke reminisces about http://www.youtube.com/watch?v=WIUAC03YMlA [18:57:12] root@wtp1006:~# ls /srv/deployment/parsoid/deploy/node_modules/express/node_modules/qs [18:57:12] benchmark.js examples.js History.md index.js lib Makefile package.json Readme.md test [18:57:18] Ryan_Lane, some of the npm modules are fetched with git, but they are not submodules in the deploy repo [18:57:34] they are just checked in regularly [18:58:02] ah [18:58:03] .gitmodules [18:58:07] the only submodule is src, which is the code [18:58:10] there's a .gitmodules in that directory [18:58:37] does that matter? [18:58:40] yes [18:58:44] it did not seem to do anything for me [18:58:55] git seems to ignore it [18:59:00] the deployment system does some magic with that [18:59:02] it's a bug [18:59:05] in this case [18:59:11] but that's what's causing the problem [18:59:14] * Ryan_Lane opens an issue [19:00:16] you should remove those from your repo [19:00:24] but I'll fix the bug either way [19:00:59] i have to head out in a couple. what's the plan? should i be doing anything else? [19:01:04] ./node_modules/express/node_modules/qs/.gitmodules [19:01:04] ./node_modules/html5/node_modules/jsdom/node_modules/cssom/.gitmodules [19:01:04] ./node_modules/request/node_modules/qs/.gitmodules [19:01:10] those need to be deleted [19:01:19] Ryan_Lane, we simply check in the node_modules unchanged after a 'npm install' run [19:01:20] from the repo [19:01:33] which adds a .gitmodules file [19:01:49] like I said, I'll fix the bug, but this is what you need to do in the meantime [19:02:00] removing .gitmodules manually every time does not sound like the best way forward [19:02:04] ..... [19:02:09] ok [19:02:13] I just said I'll fix the bug, didn't I? [19:02:52] you could also add them to your gitignore for anything other than the parent module [19:02:56] so should I abort the current deploy? [19:03:10] yes, then remove the files and start again [19:05:23] it really wouldn't hurt for someone other than me to learn the system some ;) [19:06:22] I know ottomata is also looking at it now for jvm deployment, but even just for reporting bugs, going through the troubleshooting steps would be good for some others [19:06:37] https://wikitech.wikimedia.org/wiki/Trebuchet#Troubleshooting_the_deployment_from_multiple_locations [19:06:43] I do have stuff documented [19:07:58] should i run salt 'wtp*' cmd.run "find /srv/deployment/parsoid/deploy/node_modules -name '.gitmodules'" ? [19:08:05] er, -delete at the end there [19:08:11] that won't help [19:08:18] they need to be removed from the repo [19:08:19] Ryan_Lane: I think learning about trebuchet is on my plate for this quarter. [19:08:33] bd808: talk with me as much as you need. I'm here to help [19:09:07] Ryan_Lane: Will do. I'm starting my journey with trying to understand what scap does and why [19:09:26] heh [19:09:30] that's how I started too [19:09:39] I gave a keynote at saltconf about this yesterday [19:09:51] I can give you a quick overview [19:10:34] Ryan_Lane: That would be most appreciated. Should we schedule a hangout or phone call next week? [19:10:41] yes [19:11:05] some time relatively early in the morning, or later in the evening PST time is best for me [19:12:53] gwicke: https://gerrit.wikimedia.org/r/#/c/110612/ ? [19:12:54] Ryan_Lane: 09:00 PST Tuesday? Or earlier if you'd like [19:14:24] works for me [19:14:24] ori, https://gerrit.wikimedia.org/r/#/c/110613/ [19:14:34] we worked on the same thing [19:14:35] (03PS1) 10Cmjohnson: removing professor as an example from statsd/man/init.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110614 [19:14:46] ah [19:15:36] you version looks better though [19:15:41] I'll try to fix this issue this week so that this won't be necessary [19:15:48] oh [19:15:50] heh [19:15:52] i just abandoned it [19:15:58] you can restore and +2 if you like [19:17:27] so now we just need to deploy this and deploying will work [19:17:29] I'm kind of wondering if removing the .gitmodules files breaks things for third parties that want to update node_modules [19:17:52] can figure that out later though [19:18:07] +2ed the patch [19:18:22] we'll remove that change later [19:18:31] when the deployment system can handle it [19:18:45] k [19:18:47] that said, this wouldn't break third parties anyway [19:18:47] Ryan_Lane: gcal invitation sent. I invited ori and greg-g as optional too if they have time/inclination to listen in. [19:18:52] should i delete the .gitmodules files under node_modules manually? i presume we won't be able to deploy this change since we're blocked on the problem that it fixes [19:18:53] since the .git directory doesn't work [19:18:55] err [19:18:56] doesn't exist [19:19:13] ori: hm. I think it should work even without removing it [19:19:26] so what's the next step? [19:19:36] redeploy with this change and see if it works [19:19:47] if it doesn't I'd remove them via salt [19:20:00] gwicke, want to give it a shot? [19:20:16] (03CR) 10Cmjohnson: [C: 032] removing professor as an example from statsd/man/init.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110614 (owner: 10Cmjohnson) [19:20:43] trying.. [19:21:10] 24 pending wtp1004.eqiad.wmnet: None (fetch: 1 [started: 0 mins, last-return: 0 mins]) [19:21:30] bd808: can't, RelEng weekly check-in [19:21:37] quite apropos, though [19:21:51] i may actually prefer sshing to each host and typing the entire source tree by hand at this point [19:22:04] (03PS3) 10Dzahn: add FIXMEs for erzurumi references [operations/puppet] - 10https://gerrit.wikimedia.org/r/109655 [19:22:12] (03CR) 10Cmjohnson: [C: 032] add FIXMEs for erzurumi references [operations/puppet] - 10https://gerrit.wikimedia.org/r/109655 (owner: 10Dzahn) [19:22:27] I'd recommend using salt-call deploy.fetch 'parsoid/deploy' on a single node [19:22:44] then figure out how it's breaking [19:22:47] fix the problem on one [19:22:52] then fix it on the others via salt [19:22:56] then re-run the deploy [19:23:42] ok, i ran the find -delete on wtp* hosts [19:23:58] then i ran deploy.fetch on wtp1015, and it appears to have succeeded [19:24:03] * gwicke barely suppresses remarks about using debs [19:24:11] status: 0 and tag: parsoid/deploy-20140131-192036 [19:24:16] great [19:24:24] gwicke: do a retry [19:24:25] should gwicke retry? [19:24:45] hooray! [19:24:56] after 'r' all minions are listed as done [19:25:23] should all be done now [19:25:44] and I see a full checkout on wtp1015 [19:25:58] well, check another node [19:26:11] because wtp1015 is where i manually ran deploy.fetch [19:26:59] so, all the bugs should be tracked here: https://github.com/trebuchet-deploy/trebuchet/issues [19:27:08] if one is missing, please add it [19:27:10] verified checked out on all nodes [19:28:01] ori, can you try a /etc/init.d/parsoid stop on wtp1015, followed by a service parsoid-test start? [19:28:27] were the init scripts replaced by upstarts? [19:28:46] yes, but upstart is not used yet by default [19:29:05] root@wtp1015:~# /etc/init.d/parsoid stop [19:29:05] * Stopping parsoid [ OK ] [19:29:05] root@wtp1015:~# service parsoid-test start [19:29:07] parsoid-test start/running, process 32530 [19:29:17] anyway. I'm going to disappear from this channel, since I shouldn't be here during work hours anyway [19:29:25] * Ryan_Lane waves [19:29:27] it stopped [19:29:41] gwicke: /proc/self/fd/9: 9: chdir: can't cd to /var/lib/parsoid/deploy/src [19:29:44] ahh, a bug on our end [19:29:49] (in /var/log/upstart/parsoid-test.log) [19:30:06] master(32548) initializing ../../conf/wmf/localsettings.js workers [19:30:32] i'm not optimistic about trebuchet [19:30:34] it will work in master though [19:31:19] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [19:31:21] is the issue the symlink from /var/lib/parsoid/deploy ? [19:31:25] (03Abandoned) 10Cmjohnson: remove tmh1/2 from everything, decommed (rt #6222) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94138 (owner: 10ArielGlenn) [19:31:38] lrwxrwxrwx 1 root wikidev 30 Jan 31 18:11 /var/lib/parsoid/deploy -> /srv/deployment/parsoid/deploy [19:35:10] we had a symlink before as well, but not sure if we chdir'ed into it [19:37:04] i'm going to hack it locally for a moment to move the chdir outside of the script block (making it an upstart directive rather than a command) [19:37:14] according to the chdir docs it should be able to resolve symlinks [19:40:47] * jeremyb waves alantz [19:40:57] is julie trias here? or on IRC at all? [19:41:07] i see no chip either [19:42:19] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.002 second response time [19:42:57] gwicke: that's puppet, i guess, restarting the init.d service [19:43:08] yeah [19:47:35] gwicke: it's ../../conf/wmf/localsettings.js being relative that gets you [19:48:05] in theory that should work in either location [19:48:18] at least if the chdir works [19:48:39] root@wtp1015:/var/lib/parsoid/deploy/src# ls ../../conf/wmf/localsettings.js [19:48:39] ls: cannot access ../../conf/wmf/localsettings.js: No such file or directory [19:49:28] hmm.. [19:49:35] that is passed to the server in api/ [19:50:22] but I guess that ends up using the cwd instead of the current dir [19:50:52] yeah [19:51:15] so nothing to do with the symlink; it won't resolve in either location, because it's relative to api/ [19:51:50] you can use http://nodejs.org/docs/latest/api/globals.html#globals_dirname [19:52:03] (03CR) 10Jeremyb: "Diederik said:" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 (owner: 10Jeremyb) [19:52:05] we use that for the default [19:52:17] but not for the optional --config parameter [19:52:22] which makes sense [19:52:31] so let me tweak the path [19:52:47] in that case for --config it should be relative to require.main.filename [19:52:59] since that matches the user's expectations, i think [19:53:30] no, actually.. it should be relative to cwd, you're right [19:54:37] gwicke: i think it should be path.resolve(process.cwd(), config_arg) [19:54:44] if config_arg is absolute it won't be modified [19:54:51] the base path will only be used if config_arg is a relative path [19:54:52] (03PS1) 10GWicke: WIP: Phase 2 of switching to the new Parsoid deploy system [operations/puppet] - 10https://gerrit.wikimedia.org/r/110617 [19:55:41] per [19:56:57] (03PS1) 10GWicke: Fix relative path to config [operations/puppet] - 10https://gerrit.wikimedia.org/r/110618 [19:58:17] gwicke: wouldn't that be better? [19:58:39] better than being relative to cwd? [19:58:51] I don't see the difference [19:59:03] it should be relative to cwd [19:59:12] we don't need to manually resolve the path we are passing to require [19:59:46] oh, right. so the behavior is correct, but you weren't expecting that [20:00:12] yeah, we tend to run the server from the api dir [20:00:16] (03CR) 10Ori.livneh: [C: 032] Fix relative path to config [operations/puppet] - 10https://gerrit.wikimedia.org/r/110618 (owner: 10GWicke) [20:00:23] in which case ../../ is correct [20:00:45] i'll force a puppet run on wtp* [20:01:28] https://gerrit.wikimedia.org/r/110617 is the first draft for switching to the new setup by default [20:01:42] once verified working [20:04:30] !log stopping sysv parsoid service on wtp1015 to test I74ba5f649 [20:04:39] Logged the message, Master [20:05:29] no dice [20:07:05] ori, it is expected not to fully start [20:07:09] but for a different reason [20:07:17] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [20:07:37] so is it ok, or are you unsure? [20:08:01] can you check the upstart log? [20:08:22] if the only remaining issue is the server misinterpreting -c as the number of workers, then we are good [20:08:53] we changed the -c flag from to since then [20:09:13] and we are prepping for a deploy of the new code [20:09:26] so it's ok if that is the only thing that fails [20:09:42] I can't read /var/log/upstart/parsoid-test.log or syslog [20:09:59] so if there is nothing else there then we should be fine [20:10:48] /var/log/parsoid/parsoid.log only shows the -c issue [20:11:33] i don't see any errors [20:11:42] oh, i see what you're saying [20:13:02] we could quickly bump the deploy repo to current master [20:13:05] let me do that [20:17:58] k, say when [20:18:53] waiting for the merge [20:19:26] from me or someone on parsoid? [20:20:13] btw, the pattern you take for other env vars is to define them in the job file and then allow them to be overridden in the defaults file [20:20:14] from this Jenkins guy [20:20:22] but you don't do it for PARSOID_LOG_FILE [20:20:36] why not env PARSOID_LOG_FILE=/dev/null ? [20:20:46] that way it's consistent with the other vars [20:22:21] that stuff might get cleaned up while doing the debian packaging [20:22:48] oh, come on. i'll just fix it quickly [20:22:48] we are also rewriting the logging backend, with support for structured logging, multiple log files by severity / area etc [20:23:12] so we'll likely have a split between access and error logs [20:24:00] go for it- all I'm saying is that this var is on its way out [20:24:25] Antoine added it for betalabs, and we kept it to keep the delta small for now [20:26:19] new code is pushed out now [20:26:29] (03PS1) 10Ori.livneh: parsoid upstart: make PARSOID_LOG_FILE consistent with other vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/110621 [20:26:35] can you +1? [20:27:22] (03CR) 10GWicke: [C: 031] parsoid upstart: make PARSOID_LOG_FILE consistent with other vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/110621 (owner: 10Ori.livneh) [20:27:32] (03CR) 10Ori.livneh: [C: 032 V: 032] parsoid upstart: make PARSOID_LOG_FILE consistent with other vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/110621 (owner: 10Ori.livneh) [20:29:08] I guess a part of the logging config will move into the config file as it gets too complex for env vars [20:29:47] yeah, that makes sense. i just OCD about things like this :P [20:30:38] that's fine ;) [20:30:56] can you try the upstart setup once more? [20:31:26] there we go [20:31:31] yeah, i was waiting for puppet to finish [20:32:17] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [20:33:13] ok, so we're done, right? i'm going to kill the upstart job and relaunch the sysv one instead [20:33:50] gwicke: ^ [20:33:54] seems to work [20:34:04] yup, thanks! [20:34:05] yeah [20:34:10] just did some test requests [20:36:17] overall not as easy as hoped, but we got there in the end ;) [20:37:39] gwicke: be careful with https://gerrit.wikimedia.org/r/#/c/110617; it doesn't purge parsoid-test, just creates a new parsoid in its stead [20:37:50] so you'll end up with two upstart jobs [20:38:18] i should commit a patch that ensure => absents that file and you can rebase your patch on top of that [20:38:27] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 16.666666 [20:38:37] this will also avert confusion if one of the wtp* machines has to be rebooted between now and monday [20:39:13] https://gerrit.wikimedia.org/r/#/c/110617 should not be merged before Monday [20:39:36] good point re purging [20:39:38] yeah, but when it is merged on monday you will end up with two job definitions [20:39:48] I guess we can just add it to that patch [20:40:53] I see a purge directive for directories [20:41:06] the technique for individual files is ensure abent? [20:41:09] *absent [20:41:10] yep [20:41:16] already incoming [20:41:17] k [20:41:21] (03PS1) 10Ori.livneh: parsoid: ensure => absent on /etc/init/parsoid-test [operations/puppet] - 10https://gerrit.wikimedia.org/r/110624 [20:42:11] ori, lets roll this into I2b3cc40ef [20:42:39] why? [20:42:56] you need two patches either way [20:44:11] on Monday we'd like to switch to upstart, ideally atomically [20:44:11] (03PS2) 10Ori.livneh: WIP: Phase 2 of switching to the new Parsoid deploy system [operations/puppet] - 10https://gerrit.wikimedia.org/r/110617 (owner: 10GWicke) [20:44:28] that includes 1) activating upstart, and 2) purging the old upstart [20:45:06] but then 1) you have to create a separate resource for the parsoid-test file [20:45:20] 2) you have to submit a follow-up patch to remove it, since it will be obsolete after a single puppet run [20:45:33] 3) if one of the wtp* machines is rebooted before that, upstart will compete with the init.d script [20:45:58] i rebased your patch on top of mine, take a look and see if that makes sense to you. i think it's the right way to go [20:46:02] i'd merge mine now and leave yours for monday [20:46:07] ah, I see- you want to temporarily remove it now [20:46:12] makes sense [20:46:18] cool [20:46:49] (03CR) 10Ori.livneh: [C: 032] parsoid: ensure => absent on /etc/init/parsoid-test [operations/puppet] - 10https://gerrit.wikimedia.org/r/110624 (owner: 10Ori.livneh) [20:47:21] ori, thanks for all your help with this! [20:47:40] np, happy to help [20:47:41] amazing how complex something seemingly simple can be [20:51:07] how are you going to deploy this on monday? [20:51:47] (03CR) 10GWicke: [C: 04-1] "All ready for a Monday deploy now. Ori just tested the new upstart config, deploy repo etc. This should not be merged before then, so -1in" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110617 (owner: 10GWicke) [20:52:23] ori, by hopefully getting a root to merge this at the start of our deploy window [20:52:24] gwicke: actually, i think that there is one other thing you should do [20:53:00] well, it doesn't stop the init.d job [20:53:15] that can be done with dsh [20:53:39] it is a one-off, so I figured asking a root to run this once manually is ok [20:53:41] but puppet runs on rand(last octet of ip) % 30 iirc [20:54:01] if it runs between dsh and this change getting merged it'll get restarted [20:54:33] plus, you don't remove the /etc/init.d/parsoid file [20:54:47] that can happen in a later cleanup patch [20:54:58] the service mechanism prefers upstart [20:55:27] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:55:36] I guess we could add a command to upstart that does something like killall nodejs [20:56:12] to take out the init.d instance in time [20:56:43] yes. it's /sbin/service that prefers upstart; if you specify /etc/init.d/parsoid stop it always means the sysv job [20:57:10] ori: Got a few minutes to deploy a Wikidata update? [20:57:26] hoo: the last person who said yes had his computer mysteriously shut down [20:57:29] on a Friday? [20:57:34] for reboot afaik service is used, so cleaning up the init.d later should be fine [20:57:35] fixes js issue and some missing messages [20:57:42] Yep [20:57:49] not lovely to wait until monday [20:58:02] People are reporting this issue all day long... [20:58:03] where's the patch? [20:58:10] otherwise our deplyment went super smooth [20:58:14] ori: https://gerrit.wikimedia.org/r/110536 [20:58:27] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 10.933333 [20:58:35] We tested it on beta (also l10n update) [20:59:00] that's the submodule bump, but what patchsets does that encompass? [20:59:38] ori: An autoloader-optimize (done by composer, so basically a no-op), two JS fixes, and it fixes Warnings on l10n update [21:00:36] * aude looks for patches [21:02:05] https://gerrit.wikimedia.org/r/110481 https://gerrit.wikimedia.org/r/110527 [21:02:07] js fixes [21:03:01] the autoloader thing is ugh [21:03:20] that should have been a separate patch, and you should have not cherry-picked it along with the bugfix to the production branch [21:04:56] ori: Yep, those aren't needed... back then I thaught the l10n warnings might have been coming from that [21:05:13] i thought it's "basically a no-op"! [21:05:56] In fact it is, it's just telling composer to do everything in one place, instead of using PSR-4 loaders [21:06:03] the autoload class map is not required [21:06:10] it's an optimization and nice to have [21:06:31] No point reverting the whole change [21:06:38] run composer dump-autoload without -ß [21:06:42] i was going to amend it to exclude only the non-js bits [21:06:44] * -o [21:06:51] amend the revert, i mean [21:06:55] how does that sound? [21:07:02] ori: Go ahead, will merge than [21:07:15] find to try one at at time [21:07:27] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:07:46] aude: You shouldn't have done that in the first place... https://gerrit.wikimedia.org/r/110449 [21:07:51] someohow data types i18n was missing, and think that's related to autoloading [21:08:04] (03PS3) 10GWicke: WIP: Phase 2 of switching to the new Parsoid deploy system [operations/puppet] - 10https://gerrit.wikimedia.org/r/110617 [21:08:08] so we basically already had the optimized auto loaders deployed once [21:08:20] hoo: yep, always use -o [21:08:32] :P [21:08:32] be consistent [21:08:48] i rebased it. why did it need to be rebased...? [21:09:22] (03CR) 10GWicke: [C: 04-1] "Call /etc/init.d/parsoid stop if present for transition, so that we don't have to deal with randomized puppet run timings on Monday." [operations/puppet] - 10https://gerrit.wikimedia.org/r/110617 (owner: 10GWicke) [21:09:23] ori: Because we had that other JS fix (same file, though) [21:09:38] aude: Well, that way we should probably stick with -o [21:09:46] which we originally deployed [21:10:27] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 29.200001 [21:10:38] ori: In a nutshell: The auto loader change only undid an accident before, so both versions of the autoloader have been deployed before [21:10:44] That's why it's a no-op [21:11:07] (03CR) 10Ori.livneh: "looks good" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110617 (owner: 10GWicke) [21:14:22] hoo: https://gerrit.wikimedia.org/r/#/c/110630/ [21:14:37] Approved that, let's wait for Jenkins [21:15:33] aude: We should probably also do that on master to test on beta, I never tested the l10n update on the small-autoloader [21:15:46] ure [21:15:50] sure* [21:15:50] that was never supposed to get deployed [21:16:22] eitehr way the autoload works, just not as efffient and seems localisation update prefers class map [21:16:27] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:16:36] jenkins never would have merged if autoload was a problem [21:16:45] (+ tested it) [21:16:59] aude: Probably [21:17:20] hoo: you'll bump the submodule? [21:17:26] ori: One second [21:18:37] Ok, indeed wont break l10n updates, so will bump the submodule [21:18:51] why did jenkins not submit? [21:19:00] probably still testing [21:19:04] we have lots of tests [21:19:05] unit tests are slow [21:19:07] yep [21:19:23] looks submitted now [21:19:23] no, it +2'd, just did not merge [21:19:31] there we go [21:19:34] it merged [21:19:51] branch bumped [21:20:34] aude: Daniel pointed out, that we should either fix build.sh or delete it [21:20:39] cause right now it's broken :P [21:20:48] Will pull in a lot of unrelated things [21:21:00] hoo: sure. is -o required for composer update? [21:21:16] i don't think it's broken [21:21:24] aude: Not sure [21:21:31] it will update data-values and whatever [21:21:40] and we don't want that, unless stuffs really broken [21:21:57] then we need to fix composer.json in wikibase branch [21:22:06] make sure dependencies are fixed and not set to vary at all [21:22:42] aude: composer.lock should do that... but it only does that if you run composer install, update overrides it [21:22:53] - Removing data-values/geo (0.1.1) [21:22:53] - Installing data-values/geo (0.1.2) [21:22:57] o, in composer json [21:22:59] no* [21:23:07] it can be set to allow a range, say up to 0.2 [21:23:11] 0.1.* [21:23:13] oh, you mean hard code the versions? [21:23:18] or be set to 0.0.1 [21:23:20] We could do that [21:23:22] whatever, yes [21:25:37] hoo, aude: is there a bug number for the bug that this is fixing? [21:25:43] aude: Just using composer install will probably also do just fine, that's easier... but whatever, we wont use that build thing for long anyway [21:25:51] hoo? [21:26:06] ori: We had one, but that was opened after the fix, can fetch it [21:26:19] please, so i can cite it in the sync message [21:26:38] https://bugzilla.wikimedia.org/show_bug.cgi?id=60670 [21:26:50] compoer install does not update [21:28:42] aude: I think it does, because the version is only hard coded to " "version": "dev-mw1.23-wmf11"," [21:29:04] !log ori synchronized php-1.23wmf12/extensions/Wikidata 'Iab3e51adf, I1364bfc86, I25dbed450: bug-fixes for 60670' [21:29:14] Logged the message, Master [21:29:15] ^ aude, hoo [21:29:44] :) \o [21:29:59] ori: Can you also kick of the l10n update? [21:30:28] yes [21:31:37] aude: You're right... what I did was just running "composer update wikibase/wikibase" so that it only updates Wikibase [21:31:46] * aude off [21:31:52] hoo can handle everything :) [21:32:26] ;) [21:34:06] Updating LocalisationCache for 1.23wmf12... [21:34:28] Hey everyone - does anyone know what happened to dumps.wikimedia.org? [21:34:38] I'm getting a 403 forbidden error when accessing it now [21:35:14] I cannot replicate sayhar [21:35:18] me neither [21:35:33] !log ori started scap: scap 1.23wmf12 for wikidata i18n changes [21:35:36] are you in the office? [21:35:41] Logged the message, Master [21:36:04] I am [21:36:09] drdee ^ [21:36:25] oh, now it's back. [21:36:30] !log demon synchronized php-1.23wmf11/extensions/LiquidThreads/pages/TalkpageView.php 'Fix for broken search bar' [21:36:38] Logged the message, Master [21:36:52] strange. Okay. [21:37:04] !log demon synchronized php-1.23wmf12/extensions/LiquidThreads/pages/TalkpageView.php 'Fix for broken search bar' [21:37:12] Logged the message, Master [21:40:10] suuuure, deploly LQT on a Friday.... ;) [21:41:05] LQT and Wikidata [21:41:52] shhh [21:42:24] greg-g: It's ok because it's already Saturday in some parts of the world. There's no rule about not deploying on Saturday. [21:42:42] * greg-g kicks bd808 in the shins [21:43:47] * bd808 has fallen down and can't get up [21:44:03] now you make me feel bad [21:44:15] here, have a LifeAlert(TM) [21:44:58] i'm pretty sure scap just makes up apaches [21:45:10] <^d> greg-g: Sorry about that. [21:45:12] bd808: note that on that wiki page [21:45:14] we don't really have so many servers [21:45:21] <^d> It was a regression with a 1 line fix. [21:45:39] <^d> Very visible regression for LQT users. [21:45:46] "As a deployer, I want scap2.0 to deploy to imaginary apaches so that we'll be ready when we do have them." [21:45:49] ori: I've been reading the code and I wouldn't doubt that it can randomly generate output [21:46:08] ^d: s'ok, I mostly trust you, it was a joke about LQT mostly [21:46:15] mostly mostly [21:46:15] <^d> :) [21:46:16] tampa already provides imaginary failover [21:46:25] <^d> this ^ [21:46:30] <^d> when it works! [21:46:31] :) [21:46:35] <^d> I can't always scap to it! [21:47:17] * ori jokes [21:47:24] !log ori finished scap: scap 1.23wmf12 for wikidata i18n changes (duration: 14m 26s) [21:47:33] Logged the message, Master [21:47:42] * bd808 pushes ori away from the keyboard and towards the bar down the street [21:47:54] yeah, i really should [21:48:05] ok, going to wait a couple of minutes to ensure nothing is on fire [21:48:37] ori: Verified the things on the site, everything looks fine... thanks for your help [21:50:45] no problem, i'm off too. bye [21:55:27] (03PS1) 10Cmjohnson: Removing dns entries for payments[1-4] [operations/dns] - 10https://gerrit.wikimedia.org/r/110638 [21:56:27] PROBLEM - Host payments3 is DOWN: PING CRITICAL - Packet loss = 100% [21:56:57] PROBLEM - Host payments4 is DOWN: PING CRITICAL - Packet loss = 100% [21:56:57] PROBLEM - Host payments1 is DOWN: PING CRITICAL - Packet loss = 100% [21:57:07] PROBLEM - Host payments2 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:09] (03CR) 10Hashar: [C: 031] Don't allow less than 1 thread in mw-update-l10n [operations/puppet] - 10https://gerrit.wikimedia.org/r/110609 (owner: 10Hoo man) [22:23:00] (03CR) 10Aaron Schulz: [C: 031] Don't allow less than 1 thread in mw-update-l10n [operations/puppet] - 10https://gerrit.wikimedia.org/r/110609 (owner: 10Hoo man) [22:24:27] cmjohnson1: probably want to remove payments hosts from icinga :-D [22:24:50] good idea ^ thx [22:33:41] (03PS1) 10Cmjohnson: adding payments1-4 to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110643 [22:35:26] (03PS2) 10Cmjohnson: adding payments1-4 to decom.pp and fixing white space [operations/puppet] - 10https://gerrit.wikimedia.org/r/110643 [22:39:41] (03PS3) 10Cmjohnson: blasted white space!!! adding payments1-4 to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110643 [22:42:23] (03CR) 10Cmjohnson: [C: 032] blasted white space!!! adding payments1-4 to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/110643 (owner: 10Cmjohnson) [22:44:13] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for payments[1-4] [operations/dns] - 10https://gerrit.wikimedia.org/r/110638 (owner: 10Cmjohnson) [22:55:52] (03PS1) 10Ottomata: WIP - Kafkatee puppet module [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 [22:56:31] (03CR) 10Ottomata: "WIP, currently depends on" [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 (owner: 10Ottomata) [22:58:25] ACKNOWLEDGEMENT - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% Chris Johnson possible main board issue [22:59:24] wat [23:45:11] (03Abandoned) 10Subramanya Sastry: WIP: Update parsoid puppet config to use new repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/106471 (owner: 10Subramanya Sastry)