0

Looking at a very simple program, how to figure out where the CPU is spent:

use mmap::*; use crc::{crc32, Hasher32}; use std::cmp::min; use std::env::args; use std::fs::File; use std::fs::metadata; use std::os::unix::io::AsRawFd; use std::slice::from_raw_parts; use std::time::Instant; fn main() -> std::io::Result<()> { let filename = args().nth(1).unwrap(); let t0 = Instant::now(); let file = File::open(String::from(&filename[..]))?; let fd = file.as_raw_fd(); let fs = metadata(filename)?; let sz = fs.len() as usize; let mut offset: usize = 0; let mut c32 = crc32::Digest::new(crc32::IEEE); while offset < sz { let rem = sz - offset; let rem = min(rem, 1024 * 1024); let map = MemoryMap::new(rem, &[MapOption::MapFd(fd), MapOption::MapReadable, MapOption::MapOffset(offset)]). map_err(|e| std::io::Error::new(std::io::ErrorKind::Other, e))?; let buf = map.data(); c32.write(unsafe { from_raw_parts(buf, rem) }); offset += rem; } println!("{:08x} in {:.3}", c32.sum32(), t0.elapsed().as_secs()); Ok(()) } 

This is intended to memory-map the file provided on command line, and compute the CRC32 of it. I am not looking for other implementations that may achieve that, because my intention is to practice interactions with libc functions.

The program appears to operate correctly, but consumes much more time and CPU than equivalent Go or Java program I wrote, even though I think it is compiled with optimisations:

Cargo.toml:

[profile.dev] opt-level = 3 [package] name = "mmap" version = "0.1.0" authors = ["Gee"] edition = "2018" # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html [dependencies] mmap = "~0.1.1" crc = "^1.0.0" 

Mac OSX, SSD, executing Go and Rust programs in various order to eliminate the impact of "cold" file buffers and "warm" buffers - produces about 4 seconds cold for Go, and less than a second warm run for Go, but always 30+ seconds for Rust. Tried cargo build --release, too. CPU utilisation by Rust program is substantially higher, too (eg 25% for Go, 80+% for Rust). I would expect CPU utilisation to come mostly from CRC-32 computation and copying the file contents between some buffers. The difference is then in something that causes Rust to do extra work here.

Go program scanning a 1GB XML file using the same method, syscall.Mmap, mapping 1MB at a time:

$ ../a ~/demo/dest.xml a772d8c4 in 3.978 

Cargo run:

$ cargo run ~/demo/dest.xml Finished dev [optimized + debuginfo] target(s) in 0.13s Running `target/debug/mmap /Users/ololo/demo/dest.xml` a772d8c4 in 33 

In the comments there was a request to show a Java program. Here's the Go program that can be seen to read that file in 3.9 seconds:

package main import( "fmt" "hash/crc32" "os" "syscall" "time" ) func main() { t0 := time.Now() fn := os.Args[1] fd, err := os.Open(fn) if err != nil { fmt.Printf("Couldn't open %s", err) return } defer fd.Close() fi, err := fd.Stat() if err != nil { fmt.Printf("Couldn't stat: %s", err) return } sz := fi.Size() cksum := crc32.NewIEEE() var off int64 off = 0 for sz > off { rem := 1024 * 1024 if sz - off < int64(rem) { rem = int(sz - off) } data, err := syscall.Mmap(int(fd.Fd()), off, rem, syscall.PROT_READ, syscall.MAP_SHARED) if err != nil { fmt.Printf("Couldn't mmap at %d", off) return } off += int64(rem) n, err := cksum.Write(data) if err != nil || n != len(data) { fmt.Printf("Somehow could not cksum %d", len(data)) return } syscall.Munmap(data) } fmt.Printf("%x in %.3f\n", cksum.Sum32(), time.Now().Sub(t0).Seconds()) } 
9
  • 1. cargo run --release; 2. Your abuse of unsafe when you could've done without robbed the compiler of a few valuable optimizations even with --release; 3. You didn't actually show us the java code, so we can't even tell you whether you benchmarked the same thing. Commented Oct 13, 2019 at 18:30
  • @SébastienRenauld 1. I tried that; 2. It's not abuse, it appears to be necessary to convert ptr to a slice - but more importantly, how to confirm, or how to get around not using unsafe?; 3. I thought of keeping the question short and focused on how to understand where Rust spends time. I've updated the question with the Go program that can be seen to run in 4 seconds, and consume 25% CPU. Commented Oct 13, 2019 at 18:47
  • For #1, why did you report the unoptimized run time in the question? Not that I'm calling results in shade or anything, but it's a bit misleading. Commented Oct 13, 2019 at 18:53
  • 1
    If nobody has touched this question before I get home, I'll dig into it; I don't have the tools to get through this properly on this laptop. Profiling should turn up a few things, as the 25% CPU usage would be very noticeable. Commented Oct 13, 2019 at 19:04
  • 1
    If anyone is watching, here's the update. Adding time measurements of function calls reveals there's something wrong with that mmap crate after all: ``` a772d8c4 in 37.119 Time breakdown: mmap: 32.859 .data(): 0.000 from_raw_parts: 0.000 crc32.write: 4.233 ``` Commented Oct 14, 2019 at 8:47

1 Answer 1

1

Turns out there's OS specifics here.

let map = MemoryMap::new( rem, &[ MapOption::MapFd(fd), MapOption::MapNonStandardFlags(libc::MAP_SHARED), MapOption::MapReadable, MapOption::MapOffset(offset), ], ) .map_err(|e| std::io::Error::new(std::io::ErrorKind::Other, e))?; 

The crate I am using does not expose access to MAP_SHARED explicitly for some reason, and given the comments to the question it probably does not make a difference on Linux. Mac OSX appears to treat MAP_SHARED differently from when that flag is not specified, making mmap syscall substantially slower. When MAP_SHARED is specified, it is back to about 3 seconds for the "warm" run on Mac.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.