Published: Thu Apr 14 2022

Laziness is a virtue (at least with IEnumerable)

.NET provides a lot of tools for working with collections: IEnumerables, IQueryables and LINQ. One of the things that people often don't realise about LINQ's querying methods is that they return IEnumerables that are lazily evaluated.

For example, consider the following code:

namespace Lazy
{
    class Program
    {
        static void Main(string[] args)
        {
            var collection = (new List<int> {1, 2, 3, 4}).Select(x => {
                Console.WriteLine(x);
                return x;
            });

            Console.WriteLine("Nothing has been evaluated yet...");
            collection.First();
            Console.WriteLine("Only the first item has been evaluated");
        }        
    }
}

As the logging strings imply, the .Select() block for the collection variable doesn't get evaluated when it's called. The output we see is:

Nothing has been evaluated yet...
1
Only the first item has been evaluated

So, 1 was only evaluated when we called .First(), and 2, 3 and 4 still haven't been evaluated at all, because we haven't asked for them.

On the other hand, if we'd called collection.ToList(), we'd suddenly see all the output appear.

This highlights the difference between IList and IEnumerable: The items in an IList are stored in memory as soon as the list is created, but there's no such guarantee with IEnumerable. An IEnumerable is really just an iterator over a sequence, but exactly how and when the elements in the sequence are constructed is not defined by IEnumerable.

The lazy nature of LINQ often surprises developers when they first encounter it, because they've usually been using LINQ for a while without noticing. Most of the time it doesn't actually matter, which is why it takes time to notice, but it can cause elementary performance analysis to go a little awry when somebody thinks a LINQ query is running nice and fast... because it hasn't actually been evaluated yet. It can also cause issues when the evaluation of the sequence relies on some resource that has since been cleaned up, like a database session that has been closed.

Let's look at the lazy evaluation idea a bit closer.

We can use the laziness of IEnumerable to improve our own code's efficiency. Let's say you want to write a parser for a simple little text format which has values separated by commas (let's pretend for the sake of example that this is really novel and no existing parsers exist).

You don't really want to load a potentially huge text file into memory all at once, you just want to read the file line by line, emit records as you see them and let the caller consume each record before asking for the next. You could implement your own class, stream the file, and maybe have a bool MoveNext() method and a <T> Current property and then mess around calling it with a while loop, but that's a bit cumbersome when C# will let you do the same thing with an IEnumerable.

As an aside, that's literally what an IEnumerator is. An IEnumerable exposes an IEnumerator, and foreach (var item in collection) is just a syntactic shortcut for:

while (collection.MoveNext()) {
    var item = collection.Current;
    // foreach body here
}

Back to our not-CSV parser: To use an IEnumerable in this situation we just need to declare a method with a return type of IEnumerable and then use the yield return construct to populate it.

yield return is a bit odd because it pauses the function there and gives (returns) the next item in the sequence, but the function isn't finished. The next time an item is requested, the function starts up again and continues where it left off. If you've used other languages like Python or JavaScript you might have heard of this concept referred to as a 'generator'.

It sounds a bit strange but it's quite intuitive, and our not-CSV parser looks something like this:

static IEnumerable<IDictionary<string, string>> Parse(string path) 
{
    var lines = File.ReadLines(path); // This is also a lazily loaded IEnumerable!

    if (!lines.Any()) 
    {
        // Terminate the sequence with an empty return
        yield break;
    }

    var headers = lines.First().Split(",");
    foreach (var line in lines.Skip(1)) // oops - see caveat
    {
        var dict = new Dictionary<string, string>();
        var fields = line.Split(",");
        var pairs = headers.Zip(fields, (header, value) => new { header, value });
        yield return pairs.ToDictionary(x => x.header, x => x.value);
    }
}

Usage looks like this:

IEnumerable<IDictionary<string, string>> parsed = Parse("my-csv.csv");
 foreach (var item in parsed) 
{
    Console.WriteLine($"Name = {item["Name"]}, Home = {item["Home"]}");
}

/* CSV = 
Name,Home
James Holden,Earth
Naomi Nagata,Ceres
Amos Burton,Earth
Alex Kamal,Mars

Output = 
Name = James Holden, Home = Earth
Name = Naomi Nagata, Home = Ceres
Name = Amos Burton, Home = Earth
Name = Alex Kamal, Home = Mars

There are a few note worthy things here.

The File.ReadLines() call is doing a lot of heavy lifting for us, but, helpfully, it returns an IEnumerable<string> of the lines in the file. We can iterate over it in the same way that the caller can iterate over our method, and it'll only be holding one line in memory at any time. We're setting up a chain of IEnumerables, which is fine, and really useful.

Secondly, for convenience I used the .Skip() IEnumerable method to jump past the headers line. If you think about it, Skip is an odd thing to use with a lazily populated list, because you're asking to jump to a point that hasn't been loaded and might not even exist, and the only way to get there is to iterate through all the items until you get to the right point. This doesn't sound very efficient. In this case it's only skipping the first element so it barely matters, but IEnumerable's Skip can cause performance problems when used arbitrarily because it's an iteration operation in its own right.

Third, the other thing that might look a bit odd is the .Zip() method. Zip is just a way of combining two lists of equal length into one list. It's not used that often, but it is very useful on occasion and saves having to deal with explicit for loops and matching up indices from one list to another.

So there we have it, the world's least reliable CSV parser. But one which demonstrates the idea of yielding its records on demand, rather than all at once.

Often when dealing with sequences and collections, either the collection in question is small or you need all of its elements right now, so whether or not it's generated lazily is a bit irrelevant. So where the laziness idea starts to become very useful is when we start to deal with external dependencies. A good example is one we've just seen: File.ReadLines(). We're depending on an external resource (a file) and we want to keep reading it sequentially until either it's all been consumed or the caller terminates the process. Loading potentially gigabytes into memory before we give any response would be a pretty lousy experience, especially as we might run out of memory before the caller even gets to do anything.

Another possible use case would be one that depends on a network API. For example, say you've got an API that returns a list of search results as JSON objects. Your code wants to parse each of these search results then put them into a database. At the lowest level, your code needs to create the web request, transform the JSON string into a C# object, and then return that to a caller which parses the data contained therein and puts it into your database. The lowest level of this sounds a lot like a yielded IEnumerable.

In summary, IEnumerable and yield return are syntactic sugar around the idea of a generator that evaluates its elements on demand rather than all in advance, and are well worth considering when you are dealing with sequences.

Mark Watkinson