You have 37 minutes and 14 seconds of audio, so it's going to be at least as large as the original audio file (36Mb) plus one image and a few bytes for each frame saying "the image has not changed". You have some 67,000 frames if you build a 30 fps video.
So it will look larger than the input audio file. It could be that the settings are such that the tool you're using adds full frames every now and then, like say once every second. These full frames are called key frames and allow one to start the video midway without having to replay everything from the start. So that could easily add a few Mb of data in a 37 minutes video. A 640x480 frame is about 1Mb uncompressed, probably about 0.01Mb once compressed (since it's just text). Your video has about 2,234 seconds, that would mean 0.01Mb x 2,234 = 22 Mb of key frames only and 36Mb + 22Mb = 58Mb.
Some tools can remove all the key frames. I've used ffmpeg and generategenerated mp4 without key frames. I don't know that the tool you're using would be capable of doing such, though.