Roman Numerals Look and Say in rust (BIO 2020 Q1)

Question

I was practicing question 1a of the British Informatics Olympiad 2020 past paper.

The Roman look-and-say description of a string (of Is, Vs, Xs, Ls, Cs, Ds and Ms) is made by taking each block of adjacent identical letters and replacing it with the number of occurrences of that letter, given in Roman numerals, followed by the letter itself. A block of adjacent identical letters is never broken into smaller pieces before describing it. For example:

MMXX is described as “two Ms followed by two Xs”. Since two is II in Roman numerals, this is written as IIMIIX;

IIMIIX is described as IIIIMIIIIX, which is “two Is, one M, two Is, one X”;

IIIIMIIIIX is described as IVIIMIVIIX;

It is not valid to describe III as, “two Is, one I” IIIII.

Note that Roman look-and-say descriptions are not necessarily Roman numerals.

Write a program that reads in a Roman numeral representing a number between 1 and 3999 inclusive, followed by n (1 ≤ n ≤ 50). You should apply the Roman look-and-say description n times and then output the number of Is in the final description, followed by the number of Vs.

I wrote this code (I broke it down into files as I want to reuse some of the functions for questions b&c):

q1.rs

static NUMERALS: [(u32, &str); 13] = [ (1000, "M"), (900, "CM"), (500, "D"), (400, "CD"), (100, "C"), (90, "XC"), (50, "L"), (40, "XL"), (10, "X"), (9, "IX"), (5, "V"), (4, "IV"), (1, "I") ]; pub fn look_and_say(old: &String) -> String { let mut new = String::new(); let mut current: char = old.chars().nth(0).unwrap(); let mut length: u32 = 0; for c in old.chars() { if c != current { new.push_str(&to_roman(length)); new.push(current); length = 0; current = c; } length += 1; } new.push_str(&to_roman(length)); new.push(current); new } pub fn to_roman(mut n: u32) -> String { let mut numeral = String::new(); for (key, value) in NUMERALS { while key <= n { n -= key; numeral.push_str(value); } } numeral }

q1a.rs

mod q1; use std::io; fn main() { let mut buf = String::new(); io::stdin().read_line(&mut buf); let mut numeral = String::from(buf.split_whitespace().nth(0).unwrap()); let n = buf.split_whitespace().nth(1).unwrap().parse::<u32>().unwrap(); for i in 0..n { numeral = q1::look_and_say(&numeral); } println!("{} {}", numeral.chars().filter(|c| *c == 'I').count(), numeral.chars().filter(|c| *c == 'V').count() ); }

Example runs:

MMXX 1 4 0

MMXX 3 6 2

(More test cases can be found in the mark scheme. When I tested it, the program gave the right answer for all of the test cases.)

I would be grateful to hear about any improvements that could be made to this code. Please also tell me if there is a more idiomatic way of doing something, as I am quite new to rust.

N.B. I realized halfway through writing this post that I have already posted a solution for this question in python here, but this time I am practicing it in rust (because python is slow).

Why do you need to know the value of the roman numerals? (why did you make a NUMERALS (u32, &str) array?) — user98809
– user98809, Commented Oct 6, 2024 at 14:34
@minseong unless I'm missing something, you need it to display how many of the same character there is. (E.g. to convert "XX" to "IIX" you need to know what "2" is in roman numerals.) — sbottingota
– sbottingota, Commented Oct 6, 2024 at 16:30

Matthieu M. · Accepted Answer · 2024-10-06 11:50:39Z

There's a lot of ground to cover, so buckle your seatbelt, and hop on!

Craftsmanship

No test

It's great to know that your code passes the tests in the mark scheme, but that's no substitute for the lack of unit tests of the various methods you have.

No documentation

Rust provides full support for great doc comments, and you don't use them at all, which is a shame.

A small blurb describing what the function does, its pre-conditions, post-conditions, edge-cases, etc... is always useful.

Also, in the context of Olympiads, a note on the algorithmic complexity, time-wise and space-wise, would not be amiss.

No Linting

The Rust toolchain comes with a built-in linter named Clippy: use it!

In particular, it should flag the non-idiomatic &String argument. An &str argument is just as powerful, and allows passing anything which can hold a str value, not just a String.

Non-Idiomatic item order (mild)

It's not very idiomatic in Rust code to put a private item (NUMERALS) at the top of the file, when it's nothing more than an implementation detail. In general, I would recommend organizing items from high-level to low-level, with high-level items at the top, so that readers are naturally guided towards them.

Strange use of standard input

It's a bit strange that the arguments are read from standard input, when the use is not interactive, and only 2 arguments are used.

One would expect to read them from the command line instead, using std::env::args(), which returns an Iterator<Item = String>.

Entangled

The algorithm clearly works in two steps:

Identify the sub-sequences of common characters.
"Say" them.

Your code, however, performs both steps in a single function -- with only a small helper to convert numbers to roman numerals -- and therefore makes it impossible to independently test them.

I would recommend extracting step 1 from look_and_say, and creating an iterator which returns (usize, char) when iterating over a string.

Nesting

Nesting is the enemy of the software developer: unlike a compiler our minds don't have an infinite stack of facts that we can hold onto.

As a result, it's better to keep nesting as short as possible, which means using a guard style instead:

for c in old.chars() { if c == current { length += 1; continue; } new.push_str(&to_roman(length)); new.push(current); length = 0; current = c; }

Algorithmic

Repeated work in command line splitting

Rather than using nth(0) and nth(1) to parse the command line, it would be more idiomatic to use next():

let mut iter = buf.split_whitespace(); let mut numeral = iter.next().unwrap().to_vec(); let n: u32 = iter.next().unwrap().parse().unwrap();

Pessimistic romanization

I wonder if iteration over NUMERALS in to_roman is not... pessimistic.

I would expect that most numbers pass to it will be small, because the very nature of look-and-say means that even if you have, once, a thousand I, then it'll be described as MI, and next time you'll have 1 M and 1 I.

Thus, I would expect it would be interesting to bias the implementation of to_roman for small numbers, and thus NOT to iterate from 1000, 900, etc... but instead to have a first pass, starting from the other end, to figure the biggest number smaller than n.

let bigger = NUMERALS.iter().rposition(|(k, _)| *k > n).unwrap_or(0); let start = NUMERALS.len() - bigger; for (key, value) in &NUMERALS[start..] { ... }

Performance

To String or not to String

The problem description naturally leads to thinking in terms of String, but String provides more than you need, here, being a UTF-8 encoded Unicode string type.

All the numerals that you read, or write, are pure ASCII. By using a UTF-8 oriented abstraction, you're giving up quite a bit of performance for manipulating strictly ASCII.

Instead, you should switch to the byte-representation underneath, where each ASCII character is a single byte, and manipulate (and generate) [u8].

One Byte by One Byte

Your look-up routine checks bytes one by one. That's about the least efficient way to go about it, though by Unicode it may be difficult to do any better.

Instead, you should implement the look up by advancing to the next different character; which is much easier when manipulating a slice of bytes:

fn look(bytes: &[u8]) -> impl Iterator<Item = (u32, u8)> + '_ { struct Looker<'a> { bytes: &'a [u8], } impl<'a> Iterator for Looker<'a> { type Item = (u32, u8); fn next(&mut self) -> Option<Self::Item> { let (&first, tail) = self.bytes.split_first()?; let index = tail.iter() .position(|b| *b != first) .unwrap_or(tail.len()); self.bytes = tail.split_at_checked(index).unwrap_or((&[], &[])).1; Some(((index + 1) as u32, first)) } } Looker { bytes } }

Note: for the future reader, the generator feature (gen) will make this less boilerplatey in the future.

Many, many, allocations

The algorithm creates... many, many, strings.

Each call to look_and_say will:

Creates a new, empty, String, which will be reallocated log(N) times.
Creates a new, empty, String, which will be reallocated at least once or twice, for each call to to_roman.

That a lot of memory allocations.

It would make sense, instead, to tweak the signatures in order to enable buffer reuse.

let mut old = numeral.into_vec(); let mut new = Vec::new(); for _ in 0..n { q1::look_and_say(&old, &mut new); mem::swap(&mut old, &mut new); }

With the signatures changed to match:

pub fn look_and_say(old: &[u8], new: &mut Vec<u8>); pub fn to_roman(mut n: u32, out: &mut Vec<u8>);

Even better, both old and new could start pre-sized, if you have a good heuristic for the final size they could reach. A reasonable upper-bound -- which would guarantee the absence of reallocation -- would be great, though otherwise a conservative lower-bound triggering at most one reallocation would already be pretty good.

Repeated parsing/formatting

While the challenge is expressed as parsing a string and formatting a string at each round, an implementation doesn't, actually, need to do so.

Turning things on its head, one could instead:

Parse the original numeral into a sequence of (count, character) tuples.
Then, repeatedly, apply a look-and-say which produces a sequence of (count, character) from a sequence of (count, character).
Finally produce the result from such a sequence.

This would entirely skip the parsing & formatting codes, and likely result in performance gains.

The only trick, there, would be in the "formatting" part of look-and-say, to consolidate the last element of the sequence with the element to be appended, if they have the same character.

Thank you for your answer. Regarding your issue about standard input, that is the way that the question requests for the input to be taken, which is why I have done it like that here. — sbottingota
– sbottingota, Commented Oct 5, 2024 at 15:49
@sbottingota: I kinda suspected so, it's common for this kind of challenges, and CLI arguments are messy once you need larger inputs. — Matthieu M.
– Matthieu M., Commented Oct 6, 2024 at 11:45

Stack Exchange Network

Roman Numerals Look and Say in rust (BIO 2020 Q1)

1 Answer 1

Craftsmanship

No test

No documentation

No Linting

Non-Idiomatic item order (mild)

Strange use of standard input

Entangled

Nesting

Algorithmic

Repeated work in command line splitting

Pessimistic romanization

Performance

To String or not to String

One Byte by One Byte

Many, many, allocations

Repeated parsing/formatting

You must log in to answer this question.

Linked

Hot Network Questions

Roman Numerals Look and Say in rust (BIO 2020 Q1)

1 Answer 1

Craftsmanship

No test

No documentation

No Linting

Non-Idiomatic item order (mild)

Strange use of standard input

Entangled

Nesting

Algorithmic

Repeated work in command line splitting

Pessimistic romanization

Performance

To String or not to String

One Byte by One Byte

Many, many, allocations

Repeated parsing/formatting

You must log in to answer this question.

Linked

Related

Hot Network Questions