What's the recommended way to split data in Pytensor? (original) (raw)

February 27, 2024, 8:04pm 1

Hi,

I have some data in the form of a matrix that has some missing values coded as -999 in the first column. I have one function that calculates the log-likelihood of all rows without missing data, and the other that calculates the log-likelihood of all rows with missing data. after the calculation, I want to combine the results together as one vector as the output of a likelihood function. It seems that you can’t use something like data[data[:, 0] == -999, :], which you would usually do in numpy, to subset the data in pytensor. What is the recommended way to do this in pytensor?

Thanks!

cluhmann February 27, 2024, 11:44pm 2

Can you provide a bit more context for your indexing operation? PyTensor has all the standard numpy operations e(e.g., where(), etc.). Something should work.

Thank you @cluhmann and @ricardoV94!

So I thought about using pt.where(), but this will perform computation on the full dataset in both cases, which doesn’t seem to be the most efficient way to do this.

So what I want to do, if I had numpy, is something like this:

def logp(data, ...)
    split1 = data[data[:, 0] == -999, :]
    split2 = data[data[:, 0] != -999, :]
    
    result1 = func1(split1, ...)
    result2 = func2(split2, ...)
    
    result = np.zeros(data.shape[0])
    result[data[:, 0] == -999] = result1
    result[data[:, 0] != -999] = result2
    return result

It seems that index assignment is not supported, so I will have to use pt.set_subtensor(). However, I can’t use boolean indexing in set_subtensor. How do I get around this?

You have to use pt.eq and pt.neq instead of == and !=. One of the annoying things of working with PyTensor variables. It has to do with Python constraints on equality and inequality / hashing

So this

result = np.zeros(data.shape[0])
result[data[:, 0] == -999] = result1
result[data[:, 0] != -999] = result2

Can be re-written as

result = pt.zeros(data.shape[0])
pt.set_subtensor(pt.eq(result[data[:, 0], -999]), result1, inplace=True)
pt.set_subtensor(pt.neq(result[data[:, 0], -999]), result2, inplace=True)

correct? I know that the dimensions of result1 and result2 will always be correct, but does pytensor know about this when compiling the op?

You don’t need to use the inplace flag, Pytensor will add inplace Ops itself. You can initialize the tensor with pt.empty or pt.empty_like instead.

Do note that such optimizations may not result in a faster graph. Sometimes indexing is actually slower as it breaks loop fusion, memory layouts, (and indexing itself can be slow). If the graphs of func1/func2 are Elemwise the compiler (after PyTensor) may even avoid the useless branch without you knowing it.

This worked really well. Thank you so much, @ricardoV94 and @cluhmann!