Abstract
In this paper, we tackle the problem of natural language object retrieval, where the goal is to locate a target object within an image based on a natural language description. Unlike text-based image retrieval, natural language object retrieval requires understanding the spatial relationships between objects in the scene and the overall context of the image. To address this, we introduce a novel Context Recurrent ObjNet (CRO) model that serves as a scoring function for candidate bounding boxes, integrating both spatial configurations and scene-level contextual information. Our model processes query text, local image features, spatial configurations through a recurrent network, producing a probability score for each candidate box based on the query. Additionally, the model leverages visual-linguistic knowledge from the image captioning domain to enhance retrieval accuracy.