GroupViT is a framework for learning semantic segmentation purely from text captions without using any mask supervision. It learns to perform bottom-up heirarchical spatial grouping of ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results