Menu

#280 hfst-optimized-lookup -n removes analyses with identical weight

future
closed
nobody
None
1
2015-04-08
2014-11-25
sjurum
No

Steps to reproduce:

  1. Build an fst with an extra weight attached to a symbol:

hfst-reweight -a 5 -S '#' < src/analyser-gt-desc.hfst | hfst-fst2fst -w > src/analyser-gt-desc.hfstwol

(Estonian in the Giella infra was used when the bug was discovered)

'2. analyse a word:

$ echo mina | hfst-lookup -q -p src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0,000000
mina mina+N+Sg+Gen 0,000000
mina mina+N+Sg+Nom 0,000000
mina mina+N+Sg+Par 0,000000
mina mina+Pron+Sg+Nom 0,000000
mina mi+N+Sg+Gen#na+Adv 5,000000
mina mi+N+Sg+Nom#na+Adv 5,000000

The intention is to remove all and only the analysis with higher weight than the one(s) with the lowest N weights.

'3. try this:

$ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0

Then:

$ echo mina | hfst-optimized-lookup -q -n 4 src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0
mina mina+Pron+Sg+Nom 0
mina mina+N+Sg+Gen 0
mina mina+N+Sg+Par 0

Conclusion:

That is, also with a weighted transducer, -n is taken literally and only N analyses are returned, irrespective of weight. What I had expected with a weighted transducer is something like:

Expected behavior:

$ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
mina mi+N+Sg+Ess 0,000000
mina mina+N+Sg+Gen 0,000000
mina mina+N+Sg+Nom 0,000000
mina mina+N+Sg+Par 0,000000
mina mina+Pron+Sg+Nom 0,000000

That is, with -n 1, the intention is to return all analyses with the least weight (the "first" weight, so to speak).

The present behavior could be considered both a bug in -n, or a feature. In the last case, this bug report becomes a feature request.

Related

Bugs: #280

Discussion

  • Flammie Pirinen

    Flammie Pirinen - 2014-11-25

    It seems like the expected behaviour of typical n-best (underlying assumption for openfst weights generally still is also the log prob kind where this happens more rarely):

    hfst-optimized-lookup -h Usage: hfst-optimized-lookup [OPTIONS] TRANSDUCER Run a transducer on standard input (one word per line) and print analyses ...  -n N, --analyses=N Output no more than N analyses  (if the transducer is weighted, the N best analyses) 

    The desired improvement here is the one we implemented (tried to?) for hfst-proc earlier that of int weight class structure:

    $ hfst-proc -h Usage: hfst-proc [-a [-p|-C|-x] [-k]|-g|-n|-d|-t] [-W] [-n N] [-c|-w] [-z] [-v|-q|]  transducer_file [input_file [output_file]] Perform a transducer lookup on a text stream, tokenizing on the fly Transducer must be in HFST optimized lookup format  --weight-classes N Output no more than N best weight classes  (where analyses with equal weight constitute a class 

    Ideally probably the int-based penalty weight class would be its own transducer type.

     
    • Krister Lindén

      Krister Lindén - 2014-11-25

      A more general way of doing this is to use a beam, i.e. everything
      within the beam from the best result is output. If you have weight
      classes, the beam is also an integer, but if you have general weights
      based on probabilities, you can see the beam as a confidence limit.
      --
      Krister

      On 26.11.2014 00:00, Flammie Pirinen wrote:

      It seems like the expected behaviour of typical n-best (underlying
      assumption for openfst weights generally still is also the log prob kind
      where this happens more rarely):

      hfst-optimized-lookup -h

      Usage: hfst-optimized-lookup [OPTIONS] TRANSDUCER
      Run a transducer on standard input (one word per line) and print analyses
      ...
      -n N, --analyses=N Output no more than N analyses
      (if the transducer is weighted, the N best analyses)

      The desired improvement here is the one we implemented (tried to?) for
      hfst-proc earlier that of int weight class structure:

      $ hfst-proc -h

      Usage: hfst-proc [-a [-p|-C|-x] [-k]|-g|-n|-d|-t] [-W] [-n N] [-c|-w] [-z] [-v|-q|]
      transducer_file [input_file [output_file]]
      Perform a transducer lookup on a text stream, tokenizing on the fly
      Transducer must be in HFST optimized lookup format

      --weight-classes N Output no more than N best weight classes
      (where analyses with equal weight constitute a class

      Ideally probably the int-based penalty weight class would be its own
      transducer type.


      [bugs:#280] http://sourceforge.net/p/hfst/bugs/280
      hfst-optimized-lookup -n removes analyses with identical weight

      Status: open
      Group: future
      Created: Tue Nov 25, 2014 09:59 AM UTC by sjurum
      Last Updated: Tue Nov 25, 2014 09:59 AM UTC
      Owner: nobody

      Steps to reproduce:

      1. Build an fst with an extra weight attached to a symbol:

      hfst-reweight -a 5 -S '#' < src/analyser-gt-desc.hfst | hfst-fst2fst -w

      src/analyser-gt-desc.hfstwol

      (Estonian in the Giella infra was used when the bug was discovered)

      '2. analyse a word:

      $ echo mina | hfst-lookup -q -p src/analyser-gt-desc.hfstwol
      mina mi+N+Sg+Ess 0,000000
      mina mina+N+Sg+Gen 0,000000
      mina mina+N+Sg+Nom 0,000000
      mina mina+N+Sg+Par 0,000000
      mina mina+Pron+Sg+Nom 0,000000
      mina mi+N+Sg+Gen#na+Adv 5,000000
      mina mi+N+Sg+Nom#na+Adv 5,000000

      The intention is to remove all and only the analysis with higher weight
      than the one(s) with the lowest N weights.

      '3. try this:

      $ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
      mina mi+N+Sg+Ess 0

      Then:

      $ echo mina | hfst-optimized-lookup -q -n 4 src/analyser-gt-desc.hfstwol
      mina mi+N+Sg+Ess 0
      mina mina+Pron+Sg+Nom 0
      mina mina+N+Sg+Gen 0
      mina mina+N+Sg+Par 0

      Conclusion:

      That is, also with a weighted transducer, -n is taken literally and only
      N analyses are returned, irrespective of weight. What I had expected
      with a weighted transducer is something like:

      Expected behavior:

      $ echo mina | hfst-optimized-lookup -q -n 1 src/analyser-gt-desc.hfstwol
      mina mi+N+Sg+Ess 0,000000
      mina mina+N+Sg+Gen 0,000000
      mina mina+N+Sg+Nom 0,000000
      mina mina+N+Sg+Par 0,000000
      mina mina+Pron+Sg+Nom 0,000000

      That is, with -n 1, the intention is to return all analyses with the
      least weight (the "first" weight, so to speak).

      The present behavior could be considered both a bug in -n, or a feature.
      In the last case, this bug report becomes a feature request.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/hfst/bugs/280/
      https://sourceforge.net/p/hfst/bugs/280

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      https://sourceforge.net/auth/subscriptions

       

      Related

      Bugs: #280

  • Erik Axelson

    Erik Axelson - 2015-01-30

    Option --beam added to hfst-optimized-lookup in svn revision 4169. Option -n=N is still interpreted literally, i.e. no more than N analyses are returned. Option --beam=B (-b=B in short form) returns all analyses whose weight is within B from the best analysis (the analysis with the lowest weight). Options -n and --beam are combined with AND, so they both restrict the number of results. For example, for all possible weighted outputs:

    foo 0.1
    bar 0.1
    baz 0.2
    FOO 0.5
    BAR 1.0
    BAZ 1.3

    we will get the following results with different options:

    -n=5: foo, bar, baz, FOO, BAR (five lightest results)
    -n=1: foo or bar (both have the lowest weight)
    -b=0.3: foo, bar, baz (no weights bigger than 0.1 + 0.3 = 0.4 allowed)
    -n=5 -b=0.3: foo, bar, baz (-b=0.3 restricts the results)
    -n=1 -b=0.3: foo or bar (-n=1 restricts the results)

    Options -n and -b are both applied after the actual lookup to restrict the number of results shown. Ideally, they should be applied when performing lookup if they are wanted to make lookup faster.

     

    Last edit: Erik Axelson 2015-01-30
  • Erik Axelson

    Erik Axelson - 2015-04-08
    • status: open --> closed
     
MongoDB Logo MongoDB