Abstract

<title>Abstract</title> We study whether base (non-instruction-tuned) genomic language models (gLMs) exhibit in-context learning (ICL) on DNA. Using an adapted NanoGPT trained on multiple Escherichia coli references with BPEtokenization, we frame promoter completion as a genomic ICL task:a 1,000bp upstream context (prompt) conditions autoregressive generation of downstream bases. We introduce an intrinsic evaluation suite that quantifies overall, compositional, structural, and local consistency similarity between generated and ground-truth promoter sequences, alongside loss and GC% diagnostics. Preliminary results suggest the base model learns aggregate nucleotide patterns and motif ordering signals, while position- wise fidelity remains limited. We discuss tokenization–compression trade- offs, scaling behavior, and cross-species transfer directions for evaluating emergent behavior in genomic models. Research Track: CSCI-RTCB

Affiliated Institutions

Related Publications

Publication Info

Year
2025
Type
article
Citations
0
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

0
OpenAlex

Cite This

Aadit Kapoor, Wendy Lee (2025). In-Context Learning in Genomic Language Models as a Biological Evaluation Task. . https://doi.org/10.21203/rs.3.rs-8287395/v1

Identifiers

DOI
10.21203/rs.3.rs-8287395/v1