Skip to content

Conversation

@waldbauer-certat
Copy link
Contributor

@waldbauer-certat waldbauer-certat commented Mar 17, 2021

NOTE: This is a proof of concept. Being heavily tested!

Introduction

Msgpack ( MessagePack ) is a (de)serialization format, which is similar to json, but more optimized for m2m ( Machine-to-Machine ) communication. For sure there are better protocols like protobuf, flatbuffers, capnproto, SBE and so on, but this doenst fit in intelmq very well. Msgpack uses a key-value pattern ( like in json ), so there wont be any major change. The real "magic" happens how the data is being stored, JSON is very human-readable due to its serialization, but msgpack packs data into binary format, which results in smaller size & faster processing - see the benchmark below.
If you want to know some specs, check it out here.

Msgpack itself is available for multiple languages like golang, python, javascript, php and so on.

In addition, Redis - our internal message queue - is also capable of using msgpack within its lua api.

Whats the goal?

  • Faster process time for (de)serialization.
  • less memory footprint
  • no breaking change

Benchmark

For the benchmark, data was extracted from spamhaus-drop-collector, parsed by spamhaus-drop-parser and measured in deduplicator-expert. 460 events were processed in total.

I've tested the bots above, they worked fine with that change, it might break other bots ( which I havent tested yet )

Type Median data size
JSON 387 bytes
MSGPACK 329 bytes
Diff 58 bytes ( 16,20% )

Serialize

Type Median execution time in ns
JSON 39286
MSGPACK 23483
Diff 15803 ( 50,35% )

Deserialize

Type Median execution time in ns
JSON 23483
MSGPACK 12602
Diff 10881 ( 80,62% )

To sum up, changing from json to msgpack will result in a faster (de)serialization and a lower memory footprint.

@waldbauer-certat waldbauer-certat force-pushed the waldbauer/msgpack-poc branch 13 times, most recently from 23cd283 to 9bce822 Compare March 18, 2021 12:10
@waldbauer-certat waldbauer-certat force-pushed the waldbauer/msgpack-poc branch 4 times, most recently from 5c6bdd5 to 9ab334e Compare April 1, 2021 09:32
@waldbauer-certat waldbauer-certat force-pushed the waldbauer/msgpack-poc branch 2 times, most recently from 6d9e656 to 40e4ae1 Compare June 30, 2021 15:41
@ghost ghost added the needs: feedback label Aug 20, 2021
waldbauer-certat and others added 29 commits July 15, 2022 13:21
Signed-off-by: Sebastian Waldbauer <waldbauer@cert.at>
This commit adds license information to a lot of files and adds a .reuse/dep5 file that lists the license information for some folders The commit also changes the main license in setup.cfg from AGPL-3.0-only to AGPL-3.0-or-later because only one file has the AGPL-3.0-only file as license and multiple files have the AGPL-3.0-or-later in the license header. It also removes the cef_logo.png file, as there is no information about the licese anywhere to be found. It is now included directly from the website of the european union. Closes #1633
and add legacy tag to shadowserver caida config
and add legacy tag to the configs it replaces and update changelog and documentation accordingly
fix mapping use compromised type if the data indicates an active webshell plus add testcases add changelog update bots documentation
enhance mappings add 4/6 agnostic mapping for `Sinkhole-Events` as well document feeds with IPv4 and IPv6 better and shorter
This commit adds a license header or a license file to most of the files, or documents the license in the .reuse/dep5 license file. Some of the process was automated, first by listing all the files that are not reuse lint compliant: > reuse lint > ../reuse.lst This list was then modified to remove metainformation and only list filenames. Also a couple of filenames that need manual modification were removed. Then using git and reuse: > for file in `cat ../reuse.lst`; do year=`git log --reverse --pretty="format:%ai" $file | head -1 | cut -d "-" -f 1`; author=`git log --reverse --pretty="format:%an" $file|head -1`; reuse addheader --copyright="$author" --year="$year" --license="AGPL-3.0-or-later" --skip-unrecognised $file; done Then the same process was repeated for files reuse does not recognize, like csv and json files or REQUIREMENTS.txt files.
match with RSIT in the taxonomy intrusions: compromised -> system-compromise unauthorized-command -> system-compromise unauthorized-login -> system-compromise adapt bots depending on the name add changelog and news entries, including SQL update statements
merged into information-content-security > unauthorised-information-modification adapt bots depending on the name add changelog and news entries, including SQL update statements
was renamed and marked as deprecated in 2.0.0.beta1 #1404
Compatibility with the deprecated configuration format (before 1.0.0.dev7) was removed. #1404
The deprecated shell scripts - `update-asn-data` - `update-geoip-data` - `update-tor-nodes` - `update-rfiprisk-data` have been removed in favor of the built-in update-mechanisms (see the bots' documentation). A crontab file for calling all new update command can be found in `contrib/cron-jobs/intelmq-update-database`. #1404
add two n6 images directly to the repository, as they are not displayed on readthedocs otherwise: The other websites hosting the images block loading images if the referer does not match a whitelist. we can't add a noreferer HTML attribute in rst as well. the option left is to add the files, that only implies adding the licensing information and the AGPL-3.0 license text as well. add two illustrations on the the flow n6 to intelmq and vice versa, own work. some textual improvements in the document itself.
The Aggregate Expert might be used to aggregate events within a given timespan and threshold. Signed-off-by: Sebastian Waldbauer <waldbauer@cert.at>
Using msgpack instead of json results in faster (de)serialize and less memory usage. Redis is also capable of msgpack within its lua api i.e. https://github.com/kengonakajima/lua-msgpack-native. ====== Benchmark ======= JSON median size: 387 MSGPACK median size: 329 ------------------------ Diff: 16.20% JSON * Serialize: 39286 * Deserialize: 30713 MSGPACK * Serialize: 23483 * Deserialize: 12602 --------------------- DIFF * Serialize: 50.35% * Deserialize: 83.62% Data extracted from spamhaus-collector Measurements based on deduplicator-expert 460 events in total process by deducplicator-expert Signed-off-by: Sebastian Waldbauer <waldbauer@cert.at>
Signed-off-by: Sebastian Waldbauer <waldbauer@cert.at>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment