LIVIVO - Suchergebnisse -

Suchergebnis

Treffer 1 - 10 von insgesamt 142

Buch ; Online: RAVEN

Ghosh, Partha / Sanyal, Soubhik / Schmid, Cordelia / Schölkopf, Bernhard

Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

2024

Abstract: We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies. To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware ... ...

Abstract	We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies. To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a singular latent code to model an entire video sequence. Individual video frames are then synthesized from an intermediate tri-plane representation, which itself is derived from the primary latent code. This novel strategy reduces computational complexity by a factor of $2$ as measured in FLOPs. Consequently, our approach facilitates the efficient and temporally coherent generation of videos. Moreover, our joint frame modeling approach, in contrast to autoregressive methods, mitigates the generation of visual artifacts. We further enhance the model's capabilities by integrating an optical flow-based module within our Generative Adversarial Network (GAN) based generator architecture, thereby compensating for the constraints imposed by a smaller generator size. As a result, our model is capable of synthesizing high-fidelity video clips at a resolution of $256\times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps. The efficacy and versatility of our approach are empirically validated through qualitative and quantitative assessments across three different datasets comprising both synthetic and real video clips.
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Machine Learning
Thema/Rubrik (Code)	004
Erscheinungsdatum	2024-01-11
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: gSDF

Chen, Zerui / Chen, Shizhe / Schmid, Cordelia / Laptev, Ivan

Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

2023

Abstract: Signed distance functions (SDFs) is an attractive framework that has recently shown promising results for 3D shape reconstruction from images. SDFs seamlessly generalize to different shape resolutions and topologies but lack explicit modelling of the ... ...

Abstract	Signed distance functions (SDFs) is an attractive framework that has recently shown promising results for 3D shape reconstruction from images. SDFs seamlessly generalize to different shape resolutions and topologies but lack explicit modelling of the underlying 3D geometry. In this work, we exploit the hand structure and use it as guidance for SDF-based shape reconstruction. In particular, we address reconstruction of hands and manipulated objects from monocular RGB images. To this end, we estimate poses of hands and objects and use them to guide 3D reconstruction. More specifically, we predict kinematic chains of pose transformations and align SDFs with highly-articulated hand poses. We improve the visual features of 3D points with geometry alignment and further leverage temporal information to enhance the robustness to occlusion and motion blurs. We conduct extensive experiments on the challenging ObMan and DexYCB benchmarks and demonstrate significant improvements of the proposed method over the state of the art. Comment: Accepted by CVPR 2023. Project Page: https://zerchen.github.io/projects/gsdf.html
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition
Thema/Rubrik (Code)	004
Erscheinungsdatum	2023-04-24
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: Retrieval-Enhanced Contrastive Vision-Text Models

Iscen, Ahmet / Caron, Mathilde / Fathi, Alireza / Schmid, Cordelia

2023

Abstract: Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre- ... ...

Abstract	Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark.
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition
Thema/Rubrik (Code)	006
Erscheinungsdatum	2023-06-12
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: Learning Video-Conditioned Policies for Unseen Manipulation Tasks

Chane-Sane, Elliot / Schmid, Cordelia / Laptev, Ivan

2023

Abstract: The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target ... ...

Abstract	The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target task. While prior work typically aims to imitate human demonstrations performed in robot environments, here we focus on a more realistic and challenging setup with demonstrations recorded in natural and diverse human environments. We propose Video-conditioned Policy learning (ViP), a data-driven approach that maps human demonstrations of previously unseen tasks to robot manipulation skills. To this end, we learn our policy to generate appropriate actions given current scene observations and a video of the target task. To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos. Both robot and human videos in our framework are represented by video embeddings pre-trained for human action recognition. At test time we first translate human videos to robot videos in the common video embedding space, and then use resulting embeddings to condition our policies. Notably, our approach enables robot control by human demonstrations in a zero-shot manner, i.e., without using robot trajectories paired with human instructions during training. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art. Our method also demonstrates excellent performance in a new challenging zero-shot setup where no paired data is used during training. Comment: ICRA 2023. See the project webpage at https://www.di.ens.fr/willow/research/vip/
Schlagwörter	Computer Science - Robotics ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Machine Learning
Thema/Rubrik (Code)	629 ; 004
Erscheinungsdatum	2023-05-10
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

Iscen, Ahmet / Fathi, Alireza / Schmid, Cordelia

2023

Abstract: Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from ... ...

Abstract	Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets. Comment: Accepted to CVPR 2023
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Machine Learning
Thema/Rubrik (Code)	006
Erscheinungsdatum	2023-04-11
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: AVFormer

Seo, Paul Hongsuck / Nagrani, Arsha / Schmid, Cordelia

Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

2023

Abstract: Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need ... ...

Abstract	Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech). Qualitative results show that our model effectively leverages visual information for robust speech recognition. Comment: CVPR 2023
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Sound ; Electrical Engineering and Systems Science - Audio and Speech Processing
Thema/Rubrik (Code)	006 ; 004
Erscheinungsdatum	2023-03-29
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: PolarNet

Chen, Shizhe / Garcia, Ricardo / Schmid, Cordelia / Laptev, Ivan

3D Point Clouds for Language-Guided Robotic Manipulation

2023

Abstract: The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in ... ...

Abstract	The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based policy called PolarNet for language-guided manipulation. It leverages carefully designed point cloud inputs, efficient point cloud encoders, and multimodal transformers to learn 3D point cloud representations and integrate them with language instructions for action prediction. PolarNet is shown to be effective and data efficient in a variety of experiments conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning. It also achieves promising results on a real robot. Comment: Accepted to CoRL 2023. Project website: https://www.di.ens.fr/willow/research/polarnet/
Schlagwörter	Computer Science - Robotics ; Computer Science - Computer Vision and Pattern Recognition
Thema/Rubrik (Code)	629
Erscheinungsdatum	2023-09-27
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: How can objects help action recognition?

Zhou, Xingyi / Arnab, Anurag / Sun, Chen / Schmid, Cordelia

2023

Abstract: Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, ... ...

Abstract	Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets. Comment: CVPR 2023
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition
Thema/Rubrik (Code)	006
Erscheinungsdatum	2023-06-20
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: Dense Video Object Captioning from Disjoint Supervision

Zhou, Xingyi / Arnab, Anurag / Sun, Chen / Schmid, Cordelia

2023

Abstract: We propose a new task and model for dense video object captioning -- detecting, tracking, and captioning trajectories of all objects in a video. This task unifies spatial and temporal understanding of the video, and requires fine-grained language ... ...

Abstract	We propose a new task and model for dense video object captioning -- detecting, tracking, and captioning trajectories of all objects in a video. This task unifies spatial and temporal understanding of the video, and requires fine-grained language description. Our model for dense video object captioning is trained end-to-end and consists of different modules for spatial localization, tracking, and captioning. As such, we can train our model with a mixture of disjoint tasks, and leverage diverse, large-scale datasets which supervise different parts of our model. This results in noteworthy zero-shot performance. Moreover, by finetuning a model from this initialization, we can further improve our performance, surpassing strong image-based baselines by a significant margin. Although we are not aware of other work performing this task, we are able to repurpose existing video grounding datasets for our task, namely VidSTG and VLN. We show our task is more general than grounding, and models trained on our task can directly be applied to grounding by finding the bounding box with the maximum likelihood of generating the query sentence. Our model outperforms dedicated, state-of-the-art models for spatial grounding on both VidSTG and VLN.
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition
Thema/Rubrik (Code)	004
Erscheinungsdatum	2023-06-20
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Buch ; Online: Dense Optical Tracking

Moing, Guillaume Le / Ponce, Jean / Schmid, Cordelia

Connecting the Dots

2023

Abstract: Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a ... ...

Abstract	Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .
Schlagwörter	Computer Science - Computer Vision and Pattern Recognition
Thema/Rubrik (Code)	004 ; 006
Erscheinungsdatum	2023-12-01
Erscheinungsland	us
Dokumenttyp	Buch ; Online
Datenquelle	BASE - Bielefeld Academic Search Engine (Lebenswissenschaftliche Auswahl)

Volltext online

Volltext online

Zusatzmaterialien

Fernleihe an ZB MED

Sie können sich den gewünschten Titel als lokale Nutzerin oder lokaler Nutzer von ZB MED direkt an den Standort Köln schicken lassen.

Zum Seitenanfang

Ihre letzten Suchen

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED

Volltext online

Zusatzmaterialien

Kategorien

Fernleihe an ZB MED