cmu-odml.github.io
Practical applications
Natural Language Processing with Small Feed-Forward Networks
Machine Learning at Facebook: Understanding Inference at the Edge
Recognizing People in Photos Through Private On-Device Machine Learning
Knowledge Transfer for Efficient On-device False Trigger Mitigation
Smart Reply: Automated Response Suggestion for Email
Chat Smarter with Allo
Distillation
Model Compression
Distilling the Knowledge in a Neural Network
TinyBERT: Distilling BERT for Natural Language Understanding
DistilBERT
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
Distilling Large Language Models into Tiny and Effective Students using pQRNN
Sequence-Level Knowledge Distillation
DynaBERT: Dynamic BERT with Adaptive Width and Depth
Does Knowledge Distillation Really Work?
Understanding Knowledge Distillation in Non-autoregressive Machine Translation
Training convolutional neural networks with cheap convolutions and online distillation
Moonshine: Distilling with Cheap Convolutions
Pruning
Huge list of papers on neural network pruning
Optimal Brain Damage
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
The Lottery Ticket Hypothesis: A Survey (blog post)
Bayesian Bits: Unifying Quantization and Pruning
Structured Pruning of Neural Networks with Budget-Aware Regularization
Block Pruning For Faster Transformers
Structured Pruning Learns Compact and Accurate Models
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks
Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining
Neural architecture search
SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers
FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable NAS
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
High-Performance Large-Scale Image Recognition Without Normalization
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Benchmarking
Show Your Work: Improved Reporting of Experimental Results
Showing Your Work Doesn’t Always Work
The Hardware Lottery
HULK: An Energy Efficiency Benchmark Platform for Responsible Natural Language Processing
An Analysis of Deep Neural Network Models for Practical Applications
MLPerf Inference Benchmark
MLPerf Training Benchmark
Roofline: an insightful visual performance model for multicore architectures
Evaluating the Energy Efficiency of Deep Convolutional Neural Networks on CPUs and GPUs
Deep Learning Language Modeling Workloads: Where Time Goes on Graphics Processors
IrEne: Interpretable Energy Prediction for Transformers
Expected Validation Performance and Estimation of a Random Variable’s Maximum
The Efficiency Misnomer
NeuralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks
Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression
Quantization
Scalable Methods for 8-bit Training of Neural Networks
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
Once-for-All: Train One Network and Specialize it for Efficient Deployment
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
I-BERT: Integer-only BERT Quantization
BinaryBERT
TernaryBERT: Distillation-aware Ultra-low Bit BERT
Binarized Neural Networks
Training Deep Neural Networks with 8-bit Floating Point Numbers
HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
Profile-Driven Automated Mixed Precision
Mixed Precision Training
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Differentiable Model Compression via Pseudo Quantization Noise
Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Multimodal
VisualBERT: A Simple and Performant Baseline for Vision and Language
Mapping Navigation Instructions to Cont. Control Actions with Position-Visitation Pred.
Early Fusion for Goal Directed Robotic Vision
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
DeeCap: Dynamic Early Exiting for Efficient Image Captioning
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Long-Short Transformer: Efficient Transformers for Language and Vision
Vision / Robotics
Learning Transferable Architectures for Scalable Image Recognition
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net
MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers
ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression Framework
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
Architecture-specific tricks: CNNs
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
XOR-Net: An Efficient Computation Pipeline for Binary Neural Network Inference on Edge Devices
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
FFT Convolutions are Faster than Winograd on Modern CPUs, Here’s Why
Fast Algorithms for Convolutional Neural Networks
Architecture-specific tricks: Softmax
Efficient softmax approximation for GPUs
Architecture-specific tricks: Embeddings/inputs
Adaptive Input Representations for Neural Language Modeling
Embedding Recycling for Language Models
Matryoshka Representation Learning
Task-specific tricks
A Study of Non-autoregressive Model for Sequence Generation
Mask-Predict: Parallel Decoding of Conditional Masked Language Models
Non-autoregressive neural machine translation
Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework
Architecture-specific tricks: Transformers
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Do Transformer Modifications Transfer Across Implementations and Applications?
Efficient Transformers: A Survey
Consistent Accelerated Inference via Confident Adaptive Transformers
PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
Are Sixteen Heads Really Better Than One?
Are Pre-trained Convolutions Better than Pre-trained Transformers?
FNet: Mixing Tokens with Fourier Transforms
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Universal Transformers
Synthesizer: Rethinking Self-Attention in Transformer Models
Longformer: The Long-Document Transformer
Speech
Speech recognition with deep recurrent neural networks
Streaming End-to-end Speech Recognition For Mobile Devices
Accelerating training
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Pre-Training Transformers as Energy-Based Cloze Models
Parameter-Efficient Transfer Learning for NLP
Accelerating Deep Learning by Focusing on the Biggest Losers
Dataset Distillation
Competence-based curriculum learning for neural machine translation
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning
Carbon footprint and alternative power sources
Tackling Climate Change with Machine Learning
Aligning artificial intelligence with climate change mitigation
Energy and Policy Considerations for Deep Learning in NLP
On the opportunities and risks of foundation models (Section 5.3)
Quantifying the Carbon Emissions of Machine Learning
Measuring the Carbon Intensity of AI in Cloud Instances
Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning
Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models