Helsinki Finite-State Technology / Bugs / #280 hfst-optimized-lookup -n removes analyses with identical weight

Flammie Pirinen - 2014-11-25

It seems like the expected behaviour of typical n-best (underlying assumption for openfst weights generally still is also the log prob kind where this happens more rarely):

hfst-optimized-lookup -h Usage: hfst-optimized-lookup [OPTIONS] TRANSDUCER Run a transducer on standard input (one word per line) and print analyses ... -n N, --analyses=N Output no more than N analyses (if the transducer is weighted, the N best analyses)

The desired improvement here is the one we implemented (tried to?) for hfst-proc earlier that of int weight class structure:

$ hfst-proc -h Usage: hfst-proc [-a [-p|-C|-x] [-k]|-g|-n|-d|-t] [-W] [-n N] [-c|-w] [-z] [-v|-q|] transducer_file [input_file [output_file]] Perform a transducer lookup on a text stream, tokenizing on the fly Transducer must be in HFST optimized lookup format --weight-classes N Output no more than N best weight classes (where analyses with equal weight constitute a class

Ideally probably the int-based penalty weight class would be its own transducer type.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Krister Lindén - 2014-11-25
  
  A more general way of doing this is to use a beam, i.e. everything
  within the beam from the best result is output. If you have weight
  classes, the beam is also an integer, but if you have general weights
  based on probabilities, you can see the beam as a confidence limit.
  --
  Krister
  
  On 26.11.2014 00:00, Flammie Pirinen wrote:
  
  It seems like the expected behaviour of typical n-best (underlying
  assumption for openfst weights generally still is also the log prob kind
  where this happens more rarely):
  
  hfst-optimized-lookup -h
  
  Usage: hfst-optimized-lookup [OPTIONS] TRANSDUCER
  Run a transducer on standard input (one word per line) and print analyses
  ...
  -n N, --analyses=N Output no more than N analyses
  (if the transducer is weighted, the N best analyses)
  
  The desired improvement here is the one we implemented (tried to?) for
  hfst-proc earlier that of int weight class structure:
  
  $ hfst-proc -h
  
  Usage: hfst-proc [-a [-p|-C|-x] [-k]|-g|-n|-d|-t] [-W] [-n N] [-c|-w] [-z] [-v|-q|]
  transducer_file [input_file [output_file]]
  Perform a transducer lookup on a text stream, tokenizing on the fly
  Transducer must be in HFST optimized lookup format
  
  --weight-classes N Output no more than N best weight classes
  (where analyses with equal weight constitute a class
  
  Ideally probably the int-based penalty weight class would be its own
  transducer type.
  
  [bugs:#280] http://sourceforge.net/p/hfst/bugs/280
  hfst-optimized-lookup -n removes analyses with identical weight
  
  Status: open
  Group: future
  Created: Tue Nov 25, 2014 09:59 AM UTC by sjurum
  Last Updated: Tue Nov 25, 2014 09:59 AM UTC
  Owner: nobody
  
  Steps to reproduce:
  
  Build an fst with an extra weight attached to a symbol:
  
  hfst-reweight -a 5 -S '#' < src/analyser-gt-desc.hfst | hfst-fst2fst -w
  
  src/analyser-gt-desc.hfstwol
  
  (Estonian in the Giella infra was used when the bug was discovered)
  
  '2. analyse a word:
  
  $ echo mina | hfst-lookup -q -p src/analyser-gt-desc.hfstwol
  mina mi+N+Sg+Ess 0,000000
  mina mina+N+Sg+Gen 0,000000
  mina mina+N+Sg+Nom 0,000000
  mina mina+N+Sg+Par 0,000000
  mina mina+Pron+Sg+Nom 0,000000
  mina mi+N+Sg+Gen#na+Adv 5,000000
  mina mi+N+Sg+Nom#na+Adv 5,000000
  
  The intention is to remove all and only the analysis with higher weight
  than the one(s) with the lowest N weights.
  
  '3. try this:
  
  $ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
  mina mi+N+Sg+Ess 0
  
  Then:
  
  $ echo mina | hfst-optimized-lookup -q -n 4 src/analyser-gt-desc.hfstwol
  mina mi+N+Sg+Ess 0
  mina mina+Pron+Sg+Nom 0
  mina mina+N+Sg+Gen 0
  mina mina+N+Sg+Par 0
  
  Conclusion:
  
  That is, also with a weighted transducer, -n is taken literally and only
  N analyses are returned, irrespective of weight. What I had expected
  with a weighted transducer is something like:
  
  Expected behavior:
  
  $ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
  mina mi+N+Sg+Ess 0,000000
  mina mina+N+Sg+Gen 0,000000
  mina mina+N+Sg+Nom 0,000000
  mina mina+N+Sg+Par 0,000000
  mina mina+Pron+Sg+Nom 0,000000
  
  That is, with -n 1, the intention is to return all analyses with the
  least weight (the "first" weight, so to speak).
  
  The present behavior could be considered both a bug in -n, or a feature.
  In the last case, this bug report becomes a feature request.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/hfst/bugs/280/
  https://sourceforge.net/p/hfst/bugs/280
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  https://sourceforge.net/auth/subscriptions
  
  Related
  
  Bugs: ~~#280~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Erik Axelson - 2015-01-30

Option --beam added to hfst-optimized-lookup in svn revision 4169. Option -n=N is still interpreted literally, i.e. no more than N analyses are returned. Option --beam=B (-b=B in short form) returns all analyses whose weight is within B from the best analysis (the analysis with the lowest weight). Options -n and --beam are combined with AND, so they both restrict the number of results. For example, for all possible weighted outputs:

foo 0.1
bar 0.1
baz 0.2
FOO 0.5
BAR 1.0
BAZ 1.3

we will get the following results with different options:

-n=5: foo, bar, baz, FOO, BAR (five lightest results)
-n=1: foo or bar (both have the lowest weight)
-b=0.3: foo, bar, baz (no weights bigger than 0.1 + 0.3 = 0.4 allowed)
-n=5 -b=0.3: foo, bar, baz (-b=0.3 restricts the results)
-n=1 -b=0.3: foo or bar (-n=1 restricts the results)

Options -n and -b are both applied after the actual lookup to restrict the number of results shown. Ideally, they should be applied when performing lookup if they are wanted to make lookup faster.

Last edit: Erik Axelson 2015-01-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Erik Axelson - 2015-04-08

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hfst-optimized-lookup -n removes analyses with identical weight

Group

Searches

Help

#280 hfst-optimized-lookup -n removes analyses with identical weight

Related

Discussion

Related