FDU CS自学指南
Search...
Ctrl
K
补充内容
机器学习系统
Previous
机器学习理论
Next
Computer Vision & Graphics
Last updated
1 month ago
Foundations
Seminars & Collections
MLSys & ML Compilation
Hardwares
Efficient Deep Learning
LLM Systems
Blogs & Tutorials
Hardwares & Infrastructure
Training
Inference & Serving
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
Sasha Rush - Flash LLM Series
微软亚洲研究院 - 人工智能系统
CMU 10-414 Deep Learning Systems
刘铁岩 - 分布式机器学习
A Meticulous Guide to Advances in Deep Learning Efficiency over the Years
ML Systems Onboarding Reading List
UW CSE559 Systems for ML Reading List
Andrej Karpathy - llm.c
How to Scale Your Model - A Systems View of LLMs on TPUs
Stanford MLSys Seminar
Machine Learning System Resources
Building Blocks for AI Systems
AI 系统开源课程
BBuf - How to optimize algorithms in CUDA
BBuf - How to learn deep learning frameworks
CMU 15-442 Machine Learning Systems
UCB CS294 Machine Learning Systems
CMU Machine Learning Compilation
UW CS599 ML for ML Systems
UW CS559 Systems for ML
Stanford CS217 Hardware Accelerators for Machine Learning
Cornell ECE5545 Machine Learning Hardware and Systems
UCB EE290 Hardware for Machine Learning
MIT 6.5930 Hardware Architecture for Deep Learning
MIT 6.5940 TinyML and Efficient Deep Learning Computing
UCB CS294 Machine Learning Systems (LLM Edition)
CMU 11-868 Large Language Model Systems
UMich EECS598 Systems for Generative AI
UIUC CS598 AI Efficiency: Systems & Algorithms
Jeff Dean - Designs, Lessons and Advice from Building Large Distributed Systems
Tim Dettmers - A Full Hardware Guide to Deep Learning
What Every Developer Should Know About GPU Computing
Trends in Machine Learning Hardware
From Bare Metal to a 70B Model: Infrastructure Set-up and Scripts
AI and Memory Wall
The Economics of Generative AI
How To Build A Better “Blackwell” GPU Than Nvidia Did
Huggingface - The Technology Behind BLOOM Training
Lilian Weng - How to Train Really Large Models on Many GPUs?
Andrej Karpathy - Let's reproduce GPT-2 (124M)
Everything about Distributed Training and Efficient Finetuning
DeepSpeed: Advancing MoE inference and training to power next-generation AI scale
DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
Go smol or go home - Why we should train smaller LLMs on more tokens
Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)
A100/H100 太贵,何不用 4090?
Mistral-AI - Exploring the Latency/Throughput & Cost Space for LLM Inference
Yao Fu - Full Stack Transformer Inference Optimization Season 1: Towards 100x Speedup
Yao Fu - Full Stack Transformer Inference Optimization Season 2: Deploying Long-Context Models
Lilian Weng - Large Transformer Model Inference Optimization
NVIDIA - Mastering LLM Techniques: Inference Optimization
Databricks - LLM Inference Performance Engineering: Best Practices
Anyscale - Reproducible Performance Metrics for LLM inference
Anyscale - How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
How to make LLMs go fast
Where do LLMs spend their FLOPS?
Making Deep Learning Go Brrrr From First Principles
LLM Inference Speed of Light
LLM Inference Series: 1. Introduction
LLM Inference Series: 2. The two-phase process behind LLMs’ responses
LLM Inference Series: 3. KV caching explained
LLM Inference Series: 4. KV caching, a deeper look
LLM Inference Series: 5. Dissecting model performance
Transformer Inference Arithmetic
剖析GPT推断中的批处理效应
LLM部署代价评估