Abstract
<title>Abstract</title> We study whether base (non-instruction-tuned) genomic language models (gLMs) exhibit in-context learning (ICL) on DNA. Using an adapted NanoGPT trained on multiple Escherichia coli references with BPEtokenization, we frame promoter completion as a genomic ICL task:a 1,000bp upstream context (prompt) conditions autoregressive generation of downstream bases. We introduce an intrinsic evaluation suite that quantifies overall, compositional, structural, and local consistency similarity between generated and ground-truth promoter sequences, alongside loss and GC% diagnostics. Preliminary results suggest the base model learns aggregate nucleotide patterns and motif ordering signals, while position- wise fidelity remains limited. We discuss tokenization–compression trade- offs, scaling behavior, and cross-species transfer directions for evaluating emergent behavior in genomic models. Research Track: CSCI-RTCB
Affiliated Institutions
Related Publications
Large language models encode clinical knowledge
Abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of mode...
Conditional Prompt Learning for Vision-Language Models
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently propose...
Publication Info
- Year
- 2025
- Type
- article
- Citations
- 0
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.21203/rs.3.rs-8287395/v1