knowledge / corpus
Corpus control plane
Operate the ingestion, normalization, tokenization, and embedding pipeline. Inspect quality at the row level, traverse the semantic graph, and ship reports.
sources
0
tokens
0
graph nodes
0
validated
0%
report preview
PDF — corpus.pdf
- Range
- May 05, 2026, 03:10 AM → Jun 04, 2026, 03:10 AM
- Sources
- all
- Vectors
- excluded
- Audit
- included
0 sources
Last run
2h ago
Jun 04, 2026, 01:10 AM
Next run
in 4h
Jun 04, 2026, 07:10 AM
run history
Last 14 runs
ok warn err
May 22, 2026, 03:10 AMcron · 0 */6 * * *Jun 04, 2026, 03:10 AM
4/6 steps active
pipeline
Default normalization
- 1Strip HTML & scripts
- 2Unicode NFKC normalization
- 3Collapse whitespace
- 4Near-duplicate removal (MinHash)
- 5PII detection & masking
- 6Language filter
exploratory analysis
Field summary
Max tokens8,192
Chunk size512 tok
Overlap64 tok
live preview
9 chunks · BPE
#1Transformer architectures rely o
#2ly on attention mechanisms to mo
#3o model relationships between to
#4n tokens across long contexts. I
#5s. In production, chunking and o
#6nd overlap parameters strongly a
#7ly affect retrieval quality and
#8and the cost of embedding genera
#9neration.
3 of 4 enabled
TF-IDF + n-gramsText
Sparse lexical features with configurable n-gram range
Numeric & categorical statisticsTabular
Z-score normalization, target encoding, missingness flags
Graph featuresGraph
PageRank, betweenness, community embedding
Embedding poolingEmbeddings
Mean / max / attention pooling for sequence embeddings
0 / 0 rows
Created — date range
Min confidence0.00