Also worth checking the input image resolution, because the vision tokens can dominate memory here. A smaller resized image may confirm whether this is input-size driven. If it still OOMs after ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results