Vectorized assembly code:
V1 <- A V2 <- B V3 <- V1+V2 A <- V2
4n clock cycles, because no loop iteration overhead (ignoring speedup by pipelining)