We introduce iTeach, a Mixed Reality (MR) framework to improve robot perception through real-time interactive teaching. By allowing human instructors to dynamically label robot RGB data, iTeach improves both the accuracy and adaptability of robot perception to new scenarios. The framework supports on-the-fly data collection and labeling, enhancing model performance, and generalization. Applied to door and handle detection for household tasks, iTeach integrates a HoloLens app with an interactive YOLO model. Furthermore, we introduce the IRVLUTD DoorHandle dataset. DH-YOLO, our efficient detection model, significantly enhances the accuracy and efficiency of door and handle detection, highlighting the potential of MR to make robotic systems more capable and adaptive in real-world environments. The project page is available at https://irvlutd.github.io/iTeach.
Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation
Novel Instance Detection and Segmentation (NIDS) aims at detecting and segmenting novel object instances given a few examples of each instance. We propose a unified framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment. Leveraging recent advancements in large vision methods, we utilize the Grounding DINO and Segment Anything Model (SAM) to obtain object proposals with accurate bounding boxes and masks. Central to our approach is the generation of high-quality instance embeddings. We utilize foreground feature averages of patch embeddings from the DINOv2 ViT backbone, followed by refinement through a weight adapter mechanism that we introduce. We show experimentally that our weight adapter can adjust the embeddings locally within their feature space and effectively limit overfitting. This methodology enables a straightforward matching strategy, resulting in significant performance gains. Our framework surpasses current state-of-the-art methods, demonstrating notable improvements of 22.3, 46.2, 10.3, and 24.0 in average precision (AP) across four detection datasets. In instance segmentation tasks on seven core datasets of the BOP challenge, our method outperforms the top RGB methods by 3.6 AP and remains competitive with the best RGB-D method.
Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning
We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by the unimodal prototypical networks for few-shot learning, we introduce PROTO-CLIP that utilizes image prototypes and text prototypes for few-shot learning. Specifically, PROTO-CLIP adapts the image encoder and text encoder in CLIP in a joint fashion using few-shot examples. The two encoders are used to compute prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of corresponding classes. Such a proposed alignment is beneficial for few-shot classification due to the contributions from both types of prototypes. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning as well as in the real world for robot perception.
SceneReplica: Benchmarking Real-World Robot Manipulation by Creating Replicable Scenes
We present a new reproducible benchmark for evaluating robot manipulation in the real world, specifically focusing on the task of pick-and-place. Our benchmark uses the YCB objects, a commonly used dataset in the robotics community, to ensure that our results are comparable to other studies. Additionally, the benchmark is designed to be easily reproducible in the real world, making it accessible for researchers and practitioners. We also provide our experimental results and analyses for model-based and model-free 6D robotic grasping on the benchmark, where representative algorithms for object perception, grasping planning and motion planning are evaluated. We believe that our benchmark will be a valuable tool for advancing the field of robot manipulation. By providing a standardized evaluation framework, researchers can more easily compare different techniques and algorithms, leading to faster progress in developing robot manipulation methods.
2023
2023
FewSOL: A Dataset for Few-Shot Object Learning in Robotic Environments
We introduce the Few-Shot Object Learning (FEWSOL) dataset for object recognition with a few images per object. We captured 336 real-world objects with 9 RGB-D images per object from different views. Fewsol has object segmentation masks, poses, and attributes. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. We investigated (i) few-shot object classification and (ii) joint object segmentation and few-shot classification with state-of-the-art methods for few-shot learning and meta-learning using our dataset. The evaluation results show the presence of a large margin to be improved for few-shot object classification in robotic environments, and our dataset can be used to study and enhance few-shot object recognition for robot perception. Dataset and code available at https://irvlutd.github.io/FewSOL.
A sequential approach for noninferiority or equivalence of a linear contrast under cost constraints.
Planning an appropriate sample size for a study involves considering several issues. Two important considerations are cost constraints and variability inherent in the population from which data will be sampled. Methodologists have developed sample size planning methods for two or more populations when testing for equivalence or noninferiority/superiority for a linear contrast of population means. Additionally, cost constraints and variance heterogeneity among populations have also been considered. We extend these methods by developing a theory for sequential procedures for testing the equivalence or noninferiority/superiority for a linear contrast of population means under cost constraints, which we prove to effectively utilize the allocated resources. Our method, due to the sequential framework, does not require prespecified values of unknown population variance(s), something that is historically an impediment to designing studies. Importantly, our method does not require an assumption of a specific type of distribution of the data in the relevant population from which the observations are sampled, as we make our developments in a data distribution-free context. We provide an illustrative example to show how the implementation of the proposed approach can be useful in applied research.
2021
2021
[Re] Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings
Our work consists of four parts: (1) Reproducing results from Saxena et al. [2020] (2) Adding more experiments by replacing the knowledge graph embedding method (3) and exploring the question embedding method using various transformer models (4) Verifying the importance of Relation Matching (RM) module. Based on the code shared by the authors, we have reproduced the results for the EmbedKGQA method. We have not purposely performed relation matching to validate point-4.
Edge-Detect: Edge-Centric Network Intrusion Detection using Deep Neural Network
Edge nodes are crucial for detection against multitudes of cyber attacks on Internet-of-Things endpoints and is set to become part of a multi-billion industry. The resource constraints in this novel network infrastructure tier constricts the deployment of existing Network Intrusion Detection System with Deep Learning models (DLM). We address this issue by developing a novel light, fast and accurate ‘Edge-Detect’ model, which detects Distributed Denial of Service attack on edge nodes using DLM techniques. Our model can work within resource restrictions i.e. low power, memory and processing capabilities, to produce accurate results at a meaningful pace. It is built by creating layers of Long Short-Term Memory or Gated Recurrent Unit based cells, which are known for their excellent representation of sequential data. We designed a practical data science pipeline with Recurring Neural Network to learn from the network packet behavior in order to identify whether it is normal or attack-oriented. The model evaluation is from deployment on actual edge node represented by Raspberry Pi using current cybersecurity dataset (UNSW2015). Our results demonstrate that in comparison to conventional DLM techniques, our model maintains a high testing accuracy of 99% even with lower resource utilization in terms of cpu and memory. In addition, it is nearly 3 times smaller in size than the state-of-art model and yet requires a much lower testing time.
2018
2018
MV-Tractus: A simple tool to extract motion vectors from H264 encoded video sources