NAME

   Set::Similarity - similarity measures for sets

SYNOPSIS

    use Set::Similarity::Dice;

    # object method
    my $dice = Set::Similarity::Dice->new;
    my $similarity = $dice->similarity('Photographer','Fotograf');

    # class method
    my $dice = 'Set::Similarity::Dice';
    my $similarity = $dice->similarity('Photographer','Fotograf');

    # from 2-grams
    my $width = 2;
    my $similarity = $dice->similarity('Photographer','Fotograf',$width);

    # from arrayref of tokens
    my $similarity = $dice->similarity(['a','b'],['b']);

    # from hashref of features
    my $bird = {
      wings    => true,
      eyes     => true,
      feathers => true,
      hairs    => false,
      legs     => true,
      arms     => false,
    };
    my $mammal = {
      wings    => false,
      eyes     => true,
      feathers => false,
      hairs    => true,
      legs     => true,
      arms     => true,
    };
    my $similarity = $dice->similarity($bird,$mammal);

    # from arrayref sets
    my $bird = [qw(
      wings
      eyes
      feathers
      legs
    )];
    my $mammal = [qw(
      eyes
      hairs
      legs
      arms
    )];
    my $similarity = $dice->from_sets($bird,$mammal);

DESCRIPTION

   This is the base class including mainly helper and convenience methods.

Overlap coefficient

   ( A intersect B ) / min(A,B)

Jaccard Index

   The Jaccard coefficient measures similarity between sample sets, and is
   defined as the size of the intersection divided by the size of the
   union of the sample sets

   ( A intersect B ) / (A union B)

   The Tanimoto coefficient is the ratio of the number of features common
   to both sets to the total number of features, i.e.

   ( A intersect B ) / ( A + B - ( A intersect B ) ) # the same as Jaccard

   The range is 0 to 1 inclusive.

Dice coefficient

   The Dice coefficient is the number of features in common to both sets
   relative to the average size of the total number of features present,
   i.e.

   ( A intersect B ) / 0.5 ( A + B ) # the same as sorensen

   The weighting factor comes from the 0.5 in the denominator. The range
   is 0 to 1.

METHODS

   All methods can be used as class or object methods.

new

     $object = Set::Similarity->new();

similarity

     my $similarity = $object->similarity($any1,$any1,$width);


   $any can be an arrayref, a hashref or a string. Strings are tokenized
   into n-grams of width $width.

   $width must be integer, or defaults to 1.

from_tokens

     my $similarity = $object->from_tokens(['a','b'],['b']);

from_sets

     my $similarity = $object->from_sets(['a'],['b']);


   Croaks if called directly. This method should be implemented in a child
   module.

intersection

     my $intersection_size = $object->intersection(['a'],['b']);


uniq

     my @uniq = $object->uniq(['a','b']);


   Transforms an arrayref of strings into an array of unique elements.

combined_length

     my $set_size_sum = $object->combined_length(['a'],['b']);

min

     my $min_set_size = $object->min(['a'],['b']);


ngrams

     my @monograms = $object->ngrams('abc');
     my @bigrams = $object->ngrams('abc',2);

_any

     my $arrayref = $object->_any($any,$width);


SEE ALSO

   Set::Similarity::Cosine

   Set::Similarity::Dice

   Set::Similarity::Jaccard

   Set::Similarity::Overlap

   Bag::Similarity doing the same for bags or multisets.

   Text::Levenshtein for distance measures of strings, and a very overview
   of similar modules,

   http://en.wikipedia.org/wiki/String_metric for an overview of
   similarity measures.

   Cluster::Similarity for clusters.

SOURCE REPOSITORY

   http://github.com/wollmers/Set-Similarity

AUTHOR

   Helmut Wollmersdorfer, <[email protected]>

COPYRIGHT AND LICENSE

   Copyright (C) 2013-2015 by Helmut Wollmersdorfer

   This library is free software; you can redistribute it and/or modify it
   under the same terms as Perl itself.