Why does Select-Object take so long to keep the -first objects?

There’s recently been a big discussion on the PowerShell MVP mailing list about Select-Object and its -first parameter, which instructs it to only keep the first “x” objects it’s given. The discussion basically goes like this:

1..10000 | select -first 10

In this, a collection of objects numbering 1 to 10,000 is piped to Select-Object, which only keeps 1 though 10. Yet the pipeline keeps running until all 10,000 objects have been sent through. Perhaps a stronger example is:

Get-EventLog Security | Select -first 10

Since that takes longer to run, so the delay is more noticeable. Here’s why:

We often think of PowerShell cmdlets as producing all their objects and then sending them on to the next cmdlet. In that last example, then, Get-EventLog would get ALL of the events, and then send them on to Select-Object. Get-EventLog has no way of knowing that Select-Object is only going to keep the first 10 of them, and so it wastes a lot of time getting events that we’ll never see. A better route would be to have the original cmdlet (referred to as the “upstream cmdlet”) produce only the objects we want, and Get-EventLog actually provides an option for this:

Get-EventLog Security -newest 10

That way, Get-EventLog knows to quit after 10 events.

However – while that explanation accurately describes the behavior of PowerShell’s pipeline, it’s really an oversimplification. In reality, a PowerShell cmdlet sends objects along the pipeline one at a time, and most will start doing so as soon as they retrieve their first object. In essence, a pipeline like this:

Get-EventLog Security | Select -first 10

Has two cmdlets running sort of in parallel. Get-EventLog is producing event objects, and Select-Object is dropping all but the first 10. If that’s the case, you start to wonder why Select-Object doesn’t just tell Get-EventLog to knock it off, already, and save everyone a lot of time – right?

Unfortunately, PowerShell’s pipeline is unidirectional, meaning objects go downstream only. There’s no way for one cmdlet to communicate “upstream” and give advice or instructions to another cmdlet. That’s actually an artifact of the way PowerShell cmdlets are built, so it’s worth briefly looking at that.

Inside a PowerShell cmdlet are three main blocks of code, which I’ll refer to as Begin, Process, and End. These are identical in nature to the BEGIN, PROCESS, and END scriptblocks you’d find in a PowerShell filtering function in a script, in fact. Whena cmdlet appears in the pipeline, the PowerShell engine calls its Begin code first, so that the cmdlet can do any setup it needs to do (initialize variables, for example). The End block gets called when PowerShell has finished running the pipeline, so that the cmdlet can do any “tear down” work that’s required to clean up after itself (such as closing any open connections it might have been using). But the bulk of a cmdlet’s work is done in the Process code. When objects are piped to a cmdlet, the PowerShell engine calls the Process code once for each incoming object, allowing the cmdlet to work with one object at a time.

This model is hugely significant, because it means a cmdlet never knows how many objects it will be working with – it only sees them one at a time!

Let’s take a non-computer example:

Get-PhoneCall | Select -first 10

You’ve been asked to sit in a room from 8am to 9pm and answer a phone when it rings. However, you only have to answer the first 10 calls that come in. When the phone rings the first time, you answer it, and then make a mark on a notepad. For the second call, you add another mark and answer the call. After the tenth call, you know you’re done answering the phone. However, the phone may continue to ring, and each time it does, your brain has to briefly interrupt itself and say, “I don’t need to answer that.” There’s no way, however, to tell the phone to just stop ringing – it’s going to keep ringing, and you’ll just have to ignore it if you’ve answered your quota.

That’s basically how Select-Object -first x works. Even after Select-Object has selected the number of objects you’ve told it, the PowerShell engine keeps calling Select-Object’s Process code and passing in new objects. Select knows it’s done, so it just has to immediately exit, dropping the extra objects – but it has no way of knowing how many more there will be, and it has no way of telling the PowerShell engine to knock it off.

So here’s the moral of the story: Select-Object is not an effective way to quickly obtain a small sample of objects. This:

Get-EventLog Security | Select -first 10

Runs no faster than this:

Get-EventLog Security

So why is Select-Object even useful? Well, because there ARE operations where you need to start with ALL of the objects that can be produced. For example:

Get-Process | Sort VM -descending | Select -first 10

I need ALL of the processes in order for Sort-Object to get them into the right order, but I only want to look at the top 10 consumers of virtual memory. In this case, there’s no way or reason for Get-Process to produce anything less than a full set of objects. I’m not expecting Select-Object to save me time, I simply want it to reduce the amount of data I have to look at, once that data has been placed into a particular order.

If you were thinking that Select-Object would save time by abandoning the pipeline after it had selected the number of objects you specified, you now know that your expectation was inaccurate. The only way to save time is to cut back the numbe rof objects produced by the originating cmdlet, such as:

Get-EventLog Security -newest 10

Of course, not many cmdlets have this capability – ideally, the PowerShell team might one day add a -first parameter to every cmdlet, so that you can easily (and more quickly) get a subset of objects. But until that happens, just understand how Select-Object (and the pipeline in general) works, and you’ll have a more accurate expectation of what it can do for you.