ST-1322: Trogdor Produce Runner #569

yangxi · 2019-04-03T04:24:43Z

The Trogdor Runner executes Trogdor ExternalCommandSpec tasks. It can be invoked directly from command line ./tests/trogdor_runner --spec ExternalCommandSpecFile, or driven by Trogdor agents. When running from command, the Trogdor Runner reads the spec file. If driven by Trogdor agents, this runner is waiting on the stdin for the ExternalCommandSpec tasks, then execute the tasks.

Add Trogdor Runner (./tests/trogdor/trogdor_runner.py) to execute Trogdor tasks.
Add Trogdor Produce Runner (./tests/trogdor/produce_spec_runner.py) to execute Trogdor Produce task.
Add Trogdor Produce Test to ./tests/run.sh.

edenhill · 2019-04-03T09:22:35Z

tests/trogdor/example-produce-spec.json

+"class": "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
+"bootstrapServers": "localhost:29092",
+"targetMessagesPerSec": 10000,
+"maxMessages": 50000,


Is this the total number of messages produced?
If so it seems a bit low to only produce for 5s, we'll want at least 30s of producing to amortize for startup costs.

Yeah, this is just an example spec file for demonstrating how to config the workload. For real product perf tests, this is too short.

How about this?
Let's keep it as the 5s case. I am working on creating a regression Jenkins job that compares performance between Python clients and Java clients. We will have complex spec files for that Jenkins job. After I finish that, I will add those spec files to this tests/trogdor/ directory.

Sounds good

I don't think there is much value in a perf test that runs for this short, the results will be too noisy. So let's provide a file with proper defaults. At least 30s, preferably 60s of runtime.

edenhill · 2019-04-03T09:23:19Z

tests/trogdor/example-produce-spec.json

+"targetMessagesPerSec": 10000,
+"maxMessages": 50000,
+"activeTopics": {
+"foo[1-3]": {


can we use a more descriptive name, "py_trogdor" or something?

This is a demo spec file. The idea is to use the similar one as the example of Java Trogdor Produce runner. https://github.com/apache/kafka/blob/trunk/tests/spec/simple_produce_bench.json

Changed the names to py_trogdor.

edenhill · 2019-04-03T09:32:43Z

tests/trogdor/produce_spec_runner.py

+ "p99LatencyMs": self.latency_histogram.get_value_at_percentile(99)/100.0,
+ "maxLatencyMs": self.latency_histogram.get_max_value()/100.0
+ }
+ if realQPS:


if realQPS is not None:

edenhill · 2019-04-03T09:44:46Z

tests/trogdor/produce_spec_runner.py

+ trogdor_log("ProduceSpecRunner: delivery failed: {} [{}]: {}".format(msg.topic(), msg.partition(), err))
+ self.nr_failed_messages += 1
+ now = time.time()
+ latency = now - sent_time


Should preferably skip the first few messages, or use some priming messages without a delivery report to avoid the initial startup cost (that includes bringing up connections).

I am working on adding an iteration field to the produce spec. With iteration, you can control how many times Trogdor should run the same workload. That's a better way to compare warmup Vs. stable cases.

That is probably true for the cluster itself, but each client instance will need some warmup to amortize for connection setup times, etc, and this should not be included in the final throughput or latency results.

The warmup will be the first iteration, say first 5000 messages. Also, the warmup latency is important, I suggest we should show that too.

Are the iterations using the same producer instance?
If so, then this is fine, but if new producer instances are used for each iteration we still need a warmup period per iteration to avoid including startup costs in the throughput measurements.

Any comment on this?

I added the iterations field to the spec file. When the field > 1, the trogdor_runner.py reports an JSON array of performance results, e.g. {"status": [{"totalSent": 50000, "totalRecorded": 50000, "totalError": 0, "planMPS": 10000, "averageLatencyMs": 13814.083504000002, "p50LatencyMs": 13864.95, "p95LatencyMs": 29470.71, "p99LatencyMs": 30719.99, "maxLatencyMs": 31006.71}, {"totalSent": 50000, "totalRecorded": 50000, "totalError": 0, "planMPS": 10000, "averageLatencyMs": 19482.478866999998, "p50LatencyMs": 19394.55, "p95LatencyMs": 35573.75, "p99LatencyMs": 36597.75, "maxLatencyMs": 36823.03}]}

Are the iterations reusing the same producer instance?
If not then there is no warm-up benefit on the client side, so I really think they should be reused.

Yeah, they use the same producer.

edenhill · 2019-04-03T09:47:58Z

tests/trogdor/produce_spec_runner.py

+ self.nr_finished_messages += 1
+
+ def get_msg_callback(self):
+ product_time = time.time()


produce_time, or send_time to be consistent with message_on_delivery

edenhill · 2019-04-04T09:07:39Z

tests/trogdor/trogdor_utils.py

+ pre = match.group(1)
+ start = int(match.group(2))
+ end = int(match.group(3))
+ last = match.group(4)


suffix. last could be confused with end.

edenhill · 2019-04-04T09:09:12Z

tests/trogdor/trogdor_utils.py

+ admin_request_timeout_ms = 25000
+ create_kafka_conf(bootstrap_servers, common_client_config, admin_client_config)
+ admin_conf = create_kafka_conf(bootstrap_servers, common_client_config, admin_client_config)
+ admin_conf["socket.timeout.ms"] = admin_request_timeout_ms


Why is this needed? The defaults should be good.

Try to be same as the Java Trogdor produce worker.

Okay, let's avoid setting specific configs unless we know it is needed.

Please remove the unneeded configuration.

@edenhill Done.

edenhill · 2019-04-04T09:09:44Z

tests/trogdor/trogdor_utils.py

+
+def create_producer_conn(bootstrap_servers, common_client_config, producer_config):
+ producer_conf = create_kafka_conf(bootstrap_servers, common_client_config, producer_config)
+ return Producer(**producer_conf)


no need for double **, the constructor takes a dict as is.

edenhill · 2019-04-04T09:09:56Z

tests/trogdor/trogdor_utils.py

+ return Producer(**producer_conf)
+
+
+def create_admin_conn(bootstrap_servers, common_client_config, admin_client_config):


change conn to client.

edenhill · 2019-04-04T09:11:27Z

tests/trogdor/trogdor_utils.py

+ create_topic(admin_conn, topic_name, topic)
+
+
+def create_topic(admin_conn, topic_name, topic):


admin_client

yangxi · 2019-04-04T11:19:30Z

@edenhill Thanks for the review. I addressed most of your comments.

edenhill

Some minor nits, otherwise LGTM!

The per-instance warmup is still a thing though.

edenhill · 2019-04-08T06:47:31Z

tests/trogdor/produce_spec_runner.py

 self.nr_failed_messages = 0
 self.producer = create_producer_conn(self.bootstrap_servers, self.common_client_conf, self.producer_conf)
- trogdor_log("Produce {} at message-per-sec {}".format(self.max_messages, self.qps))
+ trogdor_log("Produce {} at message-per-sec {}".format(self.max_messages, self.mps))


This could be clearer, "Produce {} messages at .."

edenhill · 2019-04-18T07:23:23Z

tests/trogdor/example-produce-spec.json

+"class": "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
+"bootstrapServers": "localhost:29092",
+"targetMessagesPerSec": 10000,
+"maxMessages": 50000,


Sounds good

edenhill · 2019-04-18T07:23:46Z

tests/trogdor/produce_spec_runner.py

+ trogdor_log("ProduceSpecRunner: delivery failed: {} [{}]: {}".format(msg.topic(), msg.partition(), err))
+ self.nr_failed_messages += 1
+ now = time.time()
+ latency = now - sent_time


Any comment on this?

edenhill · 2019-04-18T07:24:26Z

tests/trogdor/trogdor_utils.py

+ admin_request_timeout_ms = 25000
+ create_kafka_conf(bootstrap_servers, common_client_config, admin_client_config)
+ admin_conf = create_kafka_conf(bootstrap_servers, common_client_config, admin_client_config)
+ admin_conf["socket.timeout.ms"] = admin_request_timeout_ms


Please remove the unneeded configuration.

…on into trogdor

* Execute the produce spec in multiple iterations with the same producer connection. * When the iterations is larger than 1, the trogdor_runner reports an JSON array of performance results showing performance of multiple iterations.

…to trogdor

stanislavkozlovski

Some small comments. I had actually started reviewing this when you posted the PR but never got around to finishing the review. sorry!

stanislavkozlovski · 2019-04-03T15:45:34Z

tests/trogdor/example-produce-spec.json

+"class": "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
+"bootstrapServers": "localhost:29092",
+"targetMessagesPerSec": 10000,
+"maxMessages": 50000,


Maybe we should increase this such that it doesn't run for 5 seconds only. Maybe 60s? (600000)

stanislavkozlovski · 2019-04-03T15:45:51Z

tests/trogdor/example-produce-spec.json

+ "commandNode": "node0",
+ "workload": {
+"class": "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
+"bootstrapServers": "localhost:29092",


I guess this is a typo, the default broker port is 9092, right?

That is probably the docker mapped port.
@yangxi ?

@stanislavkozlovski It is the exposed Docker port, declared here https://github.com/confluentinc/confluent-kafka-python/blob/master/tests/docker/docker-compose.yaml#L14

stanislavkozlovski · 2019-04-03T15:46:18Z

tests/trogdor/produce_spec_runner.py

@@ -0,0 +1,222 @@
+# Copyright 2016 Confluent Inc.


nit: 2019 :P

nit: 2020 :P

I am still working on this PR, will fix it :)

edenhill

Previous feedback is not fully addressed.
Still open questions regarding iterations and whether they reuse or not reuse the same producer instance (they should).
And it would be good with some doc strings on the trogdor classes to understand what they are doing.

edenhill · 2019-07-03T12:31:36Z

tests/run.sh

+run_trogdor() {
+ start_cluster
+ echo "Executing Trogdor"
+ python ${TEST_SOURCE}/trogdor/trogdor_runner.py --spec ${TEST_SOURCE}/trogdor/example-produce-spec.json


this should not point to the example, but a proper file. The user/dev should copy and modify the example file into place. Alternatively we call this file produce-spec.json, without the example part, to allow it to run out of the box.

edenhill · 2019-07-03T12:32:45Z

tests/trogdor/example-produce-spec.json

+ "commandNode": "node0",
+ "workload": {
+"class": "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
+"bootstrapServers": "localhost:29092",


That is probably the docker mapped port.
@yangxi ?

edenhill · 2019-07-03T12:38:02Z

tests/trogdor/example-produce-spec.json

+"class": "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
+"bootstrapServers": "localhost:29092",
+"targetMessagesPerSec": 10000,
+"maxMessages": 50000,


I don't think there is much value in a perf test that runs for this short, the results will be too noisy. So let's provide a file with proper defaults. At least 30s, preferably 60s of runtime.

edenhill · 2019-07-03T12:43:56Z

tests/trogdor/produce_spec_runner.py

+ trogdor_log("ProduceSpecRunner: delivery failed: {} [{}]: {}".format(msg.topic(), msg.partition(), err))
+ self.nr_failed_messages += 1
+ now = time.time()
+ latency = now - sent_time


Are the iterations reusing the same producer instance?
If not then there is no warm-up benefit on the client side, so I really think they should be reused.

mhowlett · 2020-03-27T18:43:14Z

I'm going to progress the work in this PR over here: https://github.com/mhowlett/confluent-kafka-python/tree/trogdor

I'll add to the comments in this PR, where relevant, but make updates to my own branch. we can sort out anything already raised in comments here, and continue the discussion in my branch when done.

cla-assistant · 2023-08-15T18:23:19Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Xi Yang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Xi Yang added 4 commits March 6, 2019 21:57

Add the Python Trogdor Runner.

018f9c0

Parse keyGenerator and valueGenerator.

04e63f0

Add Trogdor runner test.

d9f8642

Format the python code

fa5e8ba

yangxi requested review from edenhill and stanislavkozlovski April 3, 2019 04:25

edenhill suggested changes Apr 4, 2019

View reviewed changes

Address review comments.

b75049f

yangxi requested a review from edenhill April 4, 2019 10:34

Address comments.

b8c9730

edenhill approved these changes Apr 8, 2019

View reviewed changes

Add __init__.py to tests/trogdor.

9ff126d

edenhill suggested changes Apr 18, 2019

View reviewed changes

Xi Yang added 8 commits April 19, 2019 15:07

Report exp error.

160a2a9

Make the Trogdor work with Python2.

9d73605

Use targetMessagesPerSec.

5ad8f5b

Merge branch 'master' of github.com:confluentinc/confluent-kafka-pyth…

2c87a7f

…on into trogdor

Merge branch 'trogdor' of github.com:yangxi/confluent-kafka-python in…

9f75f12

…to trogdor

Change the topic names in the example-produce-spec.json to py_trogdor.

43a62b5

Address flake8 fmt warnings.

4f3c4c5

yangxi requested a review from edenhill June 25, 2019 00:52

stanislavkozlovski reviewed Jun 27, 2019

View reviewed changes

edenhill suggested changes Jul 3, 2019

View reviewed changes

		return Producer(**producer_conf)


		def create_admin_conn(bootstrap_servers, common_client_config, admin_client_config):

		create_topic(admin_conn, topic_name, topic)


		def create_topic(admin_conn, topic_name, topic):

ST-1322: Trogdor Produce Runner #569

Are you sure you want to change the base?

ST-1322: Trogdor Produce Runner #569

Uh oh!

Conversation

yangxi commented Apr 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yangxi commented Apr 4, 2019

edenhill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stanislavkozlovski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edenhill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhowlett commented Mar 27, 2020

cla-assistant bot commented Aug 15, 2023

Labels

4 participants

yangxi commented Apr 3, 2019 •

edited

Loading