Steps to reproduce:
hfst-reweight -a 5 -S '#' < src/analyser-gt-desc.hfst | hfst-fst2fst -w > src/analyser-gt-desc.hfstwol
(Estonian in the Giella infra was used when the bug was discovered)
'2. analyse a word:
$ echo mina | hfst-lookup -q -p src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0,000000
mina mina+N+Sg+Gen 0,000000
mina mina+N+Sg+Nom 0,000000
mina mina+N+Sg+Par 0,000000
mina mina+Pron+Sg+Nom 0,000000
mina mi+N+Sg+Gen#na+Adv 5,000000
mina mi+N+Sg+Nom#na+Adv 5,000000
The intention is to remove all and only the analysis with higher weight than the one(s) with the lowest N weights.
'3. try this:
$ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0
Then:
$ echo mina | hfst-optimized-lookup -q -n 4 src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0
mina mina+Pron+Sg+Nom 0
mina mina+N+Sg+Gen 0
mina mina+N+Sg+Par 0
Conclusion:
That is, also with a weighted transducer, -n is taken literally and only N analyses are returned, irrespective of weight. What I had expected with a weighted transducer is something like:
Expected behavior:
$ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0,000000
mina mina+N+Sg+Gen 0,000000
mina mina+N+Sg+Nom 0,000000
mina mina+N+Sg+Par 0,000000
mina mina+Pron+Sg+Nom 0,000000
That is, with -n 1, the intention is to return all analyses with the least weight (the "first" weight, so to speak).
The present behavior could be considered both a bug in -n, or a feature. In the last case, this bug report becomes a feature request.
It seems like the expected behaviour of typical n-best (underlying assumption for openfst weights generally still is also the log prob kind where this happens more rarely):
The desired improvement here is the one we implemented (tried to?) for hfst-proc earlier that of int weight class structure:
Ideally probably the int-based penalty weight class would be its own transducer type.
A more general way of doing this is to use a beam, i.e. everything
within the beam from the best result is output. If you have weight
classes, the beam is also an integer, but if you have general weights
based on probabilities, you can see the beam as a confidence limit.
--
Krister
On 26.11.2014 00:00, Flammie Pirinen wrote:
Related
Bugs:
#280Option --beam added to hfst-optimized-lookup in svn revision 4169. Option -n=N is still interpreted literally, i.e. no more than N analyses are returned. Option --beam=B (-b=B in short form) returns all analyses whose weight is within B from the best analysis (the analysis with the lowest weight). Options -n and --beam are combined with AND, so they both restrict the number of results. For example, for all possible weighted outputs:
foo 0.1
bar 0.1
baz 0.2
FOO 0.5
BAR 1.0
BAZ 1.3
we will get the following results with different options:
-n=5: foo, bar, baz, FOO, BAR (five lightest results)
-n=1: foo or bar (both have the lowest weight)
-b=0.3: foo, bar, baz (no weights bigger than 0.1 + 0.3 = 0.4 allowed)
-n=5 -b=0.3: foo, bar, baz (-b=0.3 restricts the results)
-n=1 -b=0.3: foo or bar (-n=1 restricts the results)
Options -n and -b are both applied after the actual lookup to restrict the number of results shown. Ideally, they should be applied when performing lookup if they are wanted to make lookup faster.
Last edit: Erik Axelson 2015-01-30