Publications (Google Scholar Profile)

GRIN: Generative Relation and Intention Network for Multi-agent Trajectory Prediction

Published in Conference on Neural Information Processing Systems (NeurIPS), 2021

Abstract

Learning the distribution of future trajectories conditioned on the past is a crucial problem for understanding multi-agent systems. This is challenging because humans make decisions based on complex social relations and personal intents, resulting in highly complex uncertainties over trajectories. To address this problem, we propose a conditional deep generative model that combines advances in graph neural networks. The prior and recognition model encodes two types of latent codes for each agent: an inter-agent latent code to represent social relations and an intra-agent latent code to represent agent intentions. The decoder is carefully devised to leverage the codes in a disentangled way to predict multi-modal future trajectory distribution. Specifically, a graph attention network built upon inter-agent latent code is used to learn continuous pair-wise relations, and an agent’s motion is controlled by its latent intents and its observations of all other agents. Through experiments on both synthetic and real-world datasets, we show that our model outperforms previous work in multiple performance metrics. We also show that our model generates realistic multi-modal trajectories.

Recommended citation: L. Li, J. Yao, T. He, T Xiao, W. Li, J. Yan, D. Wipf, Z. Zhang. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems. NeurIPS 2021

Learning Hierarchical Graph Neural Networks for Image Clustering

Published in IEEE Conference on Computer Vision (ICCV), 2021

Abstract

We propose a hierarchical graph neural network (GNN) model that learns how to cluster a set of images into an unknown number of identities using a training set of images annotated with labels belonging to a disjoint set of identities. Our hierarchical GNN uses a novel approach to merge connected components predicted at each level of the hierarchy to form a new graph at the next level. Unlike fully unsupervised hierarchical clustering, the choice of grouping and complexity criteria stems naturally from supervision in the training set. The resulting method, Hi-LANDER, achieves an average of 54% improvement in F-score and 8% increase in Normalized Mutual Information (NMI) relative to current GNN-based clustering algorithms. Additionally, stateof-the-art GNN-based methods rely on separate models to predict linkage probabilities and node densities as intermediate steps of the clustering process. In contrast, our unified framework achieves a seven-fold decrease in computational cost. We release our training and inference code here.

Recommended citation: Y Xing, T He, T Xiao, Y Wang, Y Xiong, W Xia, D Wipf, Z Zhang, S Soatto. In Proceedings of the IEEE conference on computer vision. ICCV 2021

Visualizing and comparing AlexNet and VGG using deconvolutional layers

Published in ICML 2016 Workshop on Visualization for Deep Learning, 2016

Abstract

Convolutional Neural Networks (CNNs) have been keeping improving the performance on ImageNet classification since it is firstly successfully applied in the task in 2012. To achieve better performance, the complexity of CNNs is continually increasing with deeper and bigger architectures. Though CNNs achieved promising external classification behavior, understanding of their internal work mechanism is still limited. In this work, we attempt to understand the internal work mechanism of CNNs by probing the internal representations in two comprehensive aspects, i.e., visualizing patches in the representation spaces constructed by different layers, and visualizing visual information kept in each layer. We further compare CNNs with different depths and show the advantages brought by deeper architecture.

Recommended citation: W Yu, K Yang, Y Bai, T Xiao, H Yao, Y Rui. In ICML 2016 Workshop on Visualization for Deep Learning. ICML 2016 Workshop

The linear representation of CNN for single image

Published in ICML 2016 Workshop on Visualization for Deep Learning, 2016

Abstract

CNN can model the complex underline mappings between images and categories through several layers via non-linear activation function. However, it is hard to analyze the non-linear relation learned in the CNN. In this paper, we show that a set of well-performed CNNs (composed of convolutional layers, max-pooling layers and ReLU) are piecewise linear, i.e., linear at every single image. The nice property means that the output/score of a neuron is a linear combination of outputs of any lower layer for an image. With the property, we can distribute the score of a neuron to every position of a lower layer to probe where contributes more for the score of the neuron.

Recommended citation: W Yu, K Yang, Y Bai, T Xiao, H Yao, Y Rui. In ICML 2016 Workshop on Visualization for Deep Learning. ICML 2016 Workshop

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Published in NIPS 2015 Workshop on Machine Learning Systems, 2015

Abstract

MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from mobile devices to distributed GPU clusters.

Recommended citation: T Chen, M Li, Y Li, M Lin, N Wang, M Wang, T Xiao, B Xu, C Zhang, and Z Zhang. In NeurIPS 2015 Workshop on Machine Learning Systems. NeurIPS 2015 Workshop

The Application of Two Level Attention Models in Deep Convolutional Neural Network for Fine-grained Image Classification

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

Abstract

Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make the problem more difficult. Most fine-grained classification systems follow the pipeline of finding foreground object or object parts (where) to extract discriminative features (what).

Recommended citation: T Xiao, Y Xu, K Yang, J Zhang, Y Peng and Z Zhang. In Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR 2015

Scale-invariant convolutional neural networks

Published in arXiv, 2014

Abstract

Even though convolutional neural networks (CNN) has achieved near-human performance in various computer vision tasks, its ability to tolerate scale variations is limited. The popular practise is making the model bigger first, and then train it with data augmentation using extensive scale-jittering. In this paper, we propose a scaleinvariant convolutional neural network (SiCNN), a model designed to incorporate multi-scale feature exaction and classification into the network structure. SiCNN uses a multi-column architecture, with each column focusing on a particular scale. Unlike previous multi-column strategies, these columns share the same set of filter parameters by a scale transformation among them. This design deals with scale variation without blowing up the model size. Experimental results show that SiCNN detects features at various scales, and the classification result exhibits strong robustness against object scale variations.

Recommended citation: Y Xu, T Xiao, J Zhang, K Yang, Z Zhang. arXiv 2014

Minerva: a scalable and highly efficient deep learning training platform

Published in NeurIPS 2014 Workshop of Distributed Matrix Computations, 2014

Abstract

The tooling landscape of deep learning is fragmented by a growing gap between the generic and productivity-oriented tools that optimize for algorithm development and the task-specific ones that optimize for speed and scale. This creates an artificial barrier to bring new innovations into real-world applications. Minerva addresses this issue with a layered design that provides language flexibility and execution efficiency simultaneously within one coherent framework. It proposes a matrix-based API, resulting in compact codes and the Matlab-like, imperative and procedural coding style. The code is dynamically translated into an internal dataflow representation, which is then efficiently executed against different hardware. The same user code runs on modern laptop and workstation, high-end multi-core server, or server clusters, with and without GPU acceleration, delivering performance and scalability better than or competitive with existing tools on different platforms.

Recommended citation: M Wang, T Xiao, J Li, J Zhang, C Hong, Z Zhang. In NeurIPS 2014 workshop of Distributed Matrix Computations. NeurIPS 2014 Workshop

Bag-of-Words Based Deep Neural Network for Image Retrieval

Published in Bing Grand Challenge of the 22th ACM international Conference on Multimedia (ACM MM), 2014

Abstract

This work targets image retrieval task hold by MSR-Bing Grand Challenge. Image retrieval is considered as a challenge task because of the gap between low-level image representation and high-level textual query representation. Recently further developed deep neural network sheds light on narrowing the gap by learning high-level image representation from raw pixels. In this paper, we proposed a bag-ofwords based deep neural network for image retrieval task, which learns high-level image representation and maps images into bag-of-words space. The DNN model is trained on the large scale clickthrough data, and the relevance between query and image is measured by the cosine similarity of query’s bag-of-words representation and image’s bag-ofwords representation predicted by DNN, the visual similarity of images is computed by high-level image representation extracted via the DNN model too. Finally, PageRank algorithm is used to further improve the ranking list by considering visual similarity of images for each query. The experimental results achieved state-of-the-art performance and verified the effectiveness of our proposed method.

Recommended citation: Y Bai, W Yu, T Xiao, C Xu, K Yang, WY Ma, T Zhao. Bing Grand Challenge of the 22th ACM international Conference on Multimedia. ACM MM 2014

Error-Driven Incremental Learning in Deep Convolutional Neural Network for Large-Scale Image Classification

Published in the 22th ACM international Conference on Multimedia (ACM MM), 2014

Abstract

Supervised learning using deep convolutional neural network has shown its promise in large-scale image classification task. As a building block, it is now well positioned to be part of a larger system that tackles real-life multimedia tasks. An unresolved issue is that such model is trained on a static snapshot of data, while data is always gradually collected in real application scenario. Instead, this paper positions the training as a continuous learning process as new classes of data arrive. A system with such capability is useful in practical scenarios, as it gradually expands its capacity to predict increasing number of new classes. It is also our attempt to address the more fundamental issue: a good learning system must deal with new knowledge that it is exposed to, much as how human do.

Recommended citation: T Xiao, J Zhang, K Yang, Y Peng, Z Zhang. In Proceedings of the 22th ACM international Conference on Multimedia. ACM MM 2014

PKUICST at TRECVID 2012: Known-item Search Task

Published in TRECVID, 2012

Abstract

We participate in all two types of known-item search task of TRECVID 2012: automatic search and interactive search. This paper presents our approaches and results. We adopt three kinds of text information, which are XML documents, ASR and OCR. And we index and search the three kinds of pre-processed text individually with Lucene. In addition, the results are combined and re-ranked by two re-ranking approaches. We achieve the good performances, and official evaluation shows that our team is ranked 1st in both automatic search and interactive search.

Recommended citation: Y Peng, Y Peng, X Zhai, J Zhang, T Xiao, X Huang, and K Cai. TRECVID 2012

PKUICST at TRECVID 2012: Instance Search Task

Published in TRECVID, 2012

Abstract

We participate in all two types of instance search task in TRECVID 2012: automatic search and interactive search. This paper presents our approaches and results. In this task, we mainly focus on exploring the effective feature representation, feature matching, re-ranking algorithm and query expansion. In feature representation, we adopt two basic visual features and five keypoint-based BoW features, and combine them to represent effectively the frame image. In feature matching, multi-bag SVM is adopted since it can make full use of few query examples. Moreover, we conduct keypoint matching algorithm on the top ranked results. It is effective yet efficient since only top ranked results are concerned. In re-ranking stage, we observe that the top ranked videos always contain a few noisy videos. To eliminate such noise, we proposed a re-ranking algorithm based on semi-supervised learning to refine the top ranked results. In query expansion, we automatically crawl extra training images from Flickr according to the names of query instance. We achieve the good results in both tasks. Official evaluations show that our team is ranked 2nd on automatic search and 1st on interactive search.

Recommended citation: Y Peng, X Zhai, J Zhang, C Yao, T Xiao, Nianzu Li, and Xiaodi Luo. TRECVID 2012