Introduction

Tokenization

Workflow:

  1. Collect large text corpus
  2. Tokenize text → tokens
  3. Pretrain transformer on next-word or masked-token prediction
  4. Fine-tune on task-specific datasets
  5. Deploy for inference → text generation / Q&A / summarization