« Web Sites != Web Applications
Markdown Parsing »


LINQ To Objects and Referential Equality

Take this snippet of code:

var strings = new List<String>()

    { "a", "b", "b", "c", "d" };

 

foreach (string s in strings.Distinct())

{

    Console.WriteLine(s);

}

What do you expect when you run it?

          a
          b
          c
          d

As expected, the Distinct() extension method returned a IEnumerable with the extra b removed.

Now take this snippet:

public class Foo

{

    public Foo(string bar)

    {

        this.Bar = bar;

    }

 

    public string Bar

    {

        get;

        set;

    }

}

And...

var foos = new List<Foo>()

    { new Foo("a"), new Foo("b"), new Foo("b"), new Foo("c"), new Foo("d") };

 

foreach (Foo f in foos.Distinct())

{

    Console.WriteLine(f.Bar);

}

Running this should return the exact same result right?

         a
         b
         b
         c
         d

Alas it did not. The reason is Distinct() does its comparisons using T.Equals().

On strings, T.Equals() does a value check. On reference types (such as Foo), T.Equals() does a referential check.

This can be surprising to be people at first, especially when working with common objects like a DataRow, however Distinct() has an override that takes an IEqualityComparer.

To use this override, you either pass in one of several premade Comparer classes that come with the .NET framework, or you create your own comparer class and pass it in.

This was odd to me, because the concept of passing in a comparer class is very .NET 1.1 style, and I was surprised that you can't just pass in a compare function as a lambda.

So, I created an overridden version of Distinct that takes in a lambda as its argument.

Place these two classes in a file in your project:

public static class Extensions

{

    public static IEnumerable Distinct(this IEnumerable source, Funcbool> comparer)

    {           

        return source.Distinct(new DelegateComparer(comparer));

    }       

}

 

public class DelegateComparer : IEqualityComparer

{

    private Funcbool> _equals;

 

    public DelegateComparer(Funcbool> equals)

    {

        this._equals = equals;

    }

 

    public bool Equals(T a, T b)

    {

        return _equals(a, b);

    }

 

    public int GetHashCode(T a)

    {

        return a.GetHashCode();

    }

}

And now, we can call Distinct() and pass it a comparer as an anonymous method:

foos.Distinct((a,b) => (String.Compare(a.Bar,b.Bar) == 0))

Much better, now the code is easy to read, and we don't have to create custom IEqualityComparer classes for all of our objects.

There is one caveat, and that is that Distinct() also uses GetHashCode(). In this example, I simply overrode GetHashCode() in Foo to return Bar.GetHashCode().

However, for more flexibility, we can modify DelegateComparer to take a second lambda for the hash method.

var distinctFoos =

    foos.Distinct

    (

        (a, b) => (String.Compare(a.Bar, b.Bar) == 0),

        (a) => a.Bar.GetHashCode()

    );

Now that is a thing of beauty. The purpose of the method is easy to understand, and it is infinitely flexible.

The final version of the extension method and IEqualityComparer is below:

public static class Extensions

{

    public static IEnumerable Distinct(this IEnumerable source, Funcbool> comparer)

    {           

        return source.Distinct(new DelegateComparer(comparer));

    }

 

    public static IEnumerable Distinct(this IEnumerable source, Funcbool> comparer, Funcint> hashMethod)

    {

        return source.Distinct(new DelegateComparer(comparer,hashMethod));

    }

}

 

public class DelegateComparer : IEqualityComparer

{

    private Funcbool> _equals;

    private Funcint> _getHashCode;

 

    public DelegateComparer(Funcbool> equals)

    {

        this._equals = equals;

    }

 

    public DelegateComparer(Funcbool> equals, Funcint> getHashCode)

    {

        this._equals = equals;

        this._getHashCode = getHashCode;

    }

 

    public bool Equals(T a, T b)

    {

        return _equals(a, b);

    }

 

    public int GetHashCode(T a)

    {

        if (_getHashCode != null)       

            return _getHashCode(a);       

        else

            return a.GetHashCode();

    }

}

Drop me a comment if you found these snippets useful.

Posted by Jonathan Holland on 1/28/2009.

Tags: .NET   Tips-and-Tricks   LINQ

Comments:

Why not just
foreach (string s in foos.Select (f => f.Bar).Distinct ())
?

Gravatar Posted by jason on 1/29/2009.

@Jason

Your example does not do the same thing as mine. If I wanted to iterate the list of strings in Foo, then your example is fine. However, in reality, Foo would be a complex object, and we might want to compare it by keying it on several fields...and then use other fields...

Just grabbing one property out of it is an oversimplification.

Additionally, by chaining a Select() and a Distinct(), you are iterating the list twice.

Gravatar Posted by Jonathan Hollland on 1/29/2009.

Alright Jon, give us some material on LINQ to SQL.

Real quick example:

we can take the Distinct() and LTS investigation a bit further...Your example uses one set of "foos" (locally) but what about an entity relationship in some external data source?

 var p =
   (from em in db.Employees
       join ehr in db.Stuffs
       on em.HireDate equals ehr.StartDate
       select new
       { em.EmployeeID, Date = ehr.StartDate}
   ).ToList().Distinct();

Now we know that .Distinct() is client side... as evidenced by the generated SQL

  select [t0].[EmployeeID], [t1].[StartDate]
     from [HR].[Employee] as [t0]
     inner join [HR].[Stuff] as [t1]
     on [t0].[HireDate] = [t1].[StartDate]

Now instead of using .Distinct() we can take advantage of the relationships we have and modify the query slightly

 var p =
   (from em in db.Employees
       join ehr in db.Stuffs
       on em.HireDate equals ehr.StartDate
       where em.EmployeeID == ehr.EmployeeID
       select new
       {
          em.EmployeeID,
          Date = ehr.StartDate
       }
   ).ToList();

This method is much more effecient because it will not return unecessary records from the SQL db just to be filtered out by C# with .Distinct() ... as we can see from the generated SQL:

  select [t0].[EmployeeID], [t1].[StartDate]
      from [HR].[Employee] as [t0]
      inner join [HR].[Stuff] as [t1]
      on [t0].[HireDate] = [t1].[StartDate]
      where [t0].[EmployeeID] = [t1].[EmployeeID]

If you want all results from your join the LTS gives you another approach: the "group join"

  var q =
     (from em in db.Employees
         join ehr in db.Stuffs
         on em.HireDate equals edh.StartDate into allEm
         select new {em, allEm }
     ).ToList()

which generates a rather long left outer join statement.

Gravatar Posted by infoe on 1/29/2009.

var p = (from em in db.Employees join ehr in db.Stuffs on em.HireDate equals ehr.StartDate select new { em.EmployeeID, Date = ehr.StartDate} ).ToList().Distinct();

I'm pretty sure that the reason the SQL generated didn't include "distinct" is because you called "ToList()" right before it. This causes the query to be executed immediately.

the following will produce the same result, but will run distinct at the SQL server before the set is returned.

var p = (from em in db.Employees join ehr in db.Stuffs on em.HireDate equals ehr.StartDate select new { em.EmployeeID, Date = ehr.StartDate} ).Distinct().ToList();

the reason for this is covered extensively in any LINQ book, and probably also on MSDN.

Gravatar Posted by Andrew Theken on 2/2/2009.

@Andrew Theken

Thanks for the clarification!

In fact, the reason is not covered extensively "in any" LINQ book... I feel I've been robbed since the book I bought has mislead me :)

ISBN-13(electronic): 978-1-4302-0597-5 ppg 99 - 100

"The reason for this is because the .Distinct() LINQ operator is handling the filtering on the client (that is, C#/framework) versus sending the command to the database. Because the join syntax is being translated into quality SQL, one method of "fixing" the SQL is to add another clause into the LINQ query"

Gravatar Posted by infoe on 2/2/2009.

@infoe

this is a good page for understanding what's happening with the two querys:

Query Concepts in LINQ to SQL

and, more specifically, this article. Remote vs. Local Query Execution (LINQ to SQL)

Gravatar Posted by Andrew Theken on 2/2/2009.

Comments are closed on this post.