Transfer Learning and Domain Adaptation in Deep Learning
intermediatev1.0.0tokenshrink-v2
TL=pretrained model reuse across tasks/domains; DA=TL subset aligning src & tgt dists; Fundamentals: DL models learn hierarchical reprs; shallow layers detect edges, deep layers capture semantics; TL exploits cross-task similarity; 3-stage TL: 1) src task pretrain (e.g., ImageNet), 2) transfer method, 3) tgt task finetune; DA addresses P_src(X,Y) ≠ P_tgt(X,Y); types: inductive (tgt labels avail), transductive (unlabeled tgt, src labels), unsupervised DA (no tgt labels); Key concepts: feature reuse—shallow/medium layers transfer better; representation alignment—min D(P_src(Z),P_tgt(Z)); adversarial DA—D trained to distinguish src/tgt, F trained to fool D; MMD=max mean discrepancy—penalize feature dist divergence; CORAL=align 2nd-order stats; entropy minimization—encourage confident preds on tgt; self-training—pseudo-label high-conf tgt samples, retrain; no-free-lunch—TL fails if src/tgt too dissimilar; negative transfer—worse perf than random init; Practical apps: medical imaging (src: natural img, tgt: X-rays), NLP (BERT→domain-specific QA), autonomous driving (sim→real); SOTA: DINOv2—self-supervised ViT pretrained on diverse imgs, enables strong zero-shot DA; CDAN—conditional DANN, aligns joint dists; SHOT—self-supervised fine-tuning via info maximization, no src data needed; AdaMatch—unifies consistency reg & DA, boosts semi-supervised TL; TENT—test-time entropy minimization, adapts BN stats online; Challenges: modality gap (img→text), label shift (P(Y) differs), concept drift (P(Y|X) evolves); Common pitfalls: overfitting tgt—use reg (dropout, WD), small tgt—freeze early layers; domain collapse—F maps all to same rep; forgetting src knowledge—use elastic weight consolidation (EWC); debugging: monitor src/tgt acc, feature t-SNE, grad norms; Implementation: use HuggingFace (NLP) or TorchVision (CV); freeze conv layers, replace classifier head, low LR (1e-5), gradual unfreezing; Metrics: tgt accuracy, H-score (measures DA fairness), ∆acc (vs no-TL); Future: foundation models (e.g., CLIP, Flamingo) as universal src; emergent DA via prompting; dynamic arch adaptation; Key insight: TL not just weight init—it's semantic priors; DA success depends on transferable features & alignment strength; Rule of thumb: if |P_src - P_tgt| < ε, fine-tuning suffices; else, use explicit DA; Optimal strategy: hybrid—self-supervised pretrain + adversarial alignment + self-training.
Showing 20% preview. Upgrade to Pro for full access.