knowledge / corpus

Corpus control plane

Operate the ingestion, normalization, tokenization, and embedding pipeline. Inspect quality at the row level, traverse the semantic graph, and ship reports.

sources
0
tokens
0
graph nodes
0
validated
0%
01 · output

Report generator

Compose a corpus snapshot scoped to time range, sources, and format. Frontend stub — wire to your reporting service.

report preview
PDF — corpus.pdf
Range
May 05, 2026, 03:10 AMJun 04, 2026, 03:10 AM
Sources
all
Vectors
excluded
Audit
included
02 · ingestion

Source management

Validate, block, and inspect ingestion sources. Inline actions trigger pipeline events; metadata opens in a side panel.

0 sources
03 · automation

Scheduler

Cron-driven ingestion. Toggle auto-update, fire manual runs, or inspect the recent timeline.

Last run
2h ago
Jun 04, 2026, 01:10 AM
Next run
in 4h
Jun 04, 2026, 07:10 AM
run history
Last 14 runs
ok warn err
May 22, 2026, 03:10 AMcron · 0 */6 * * *Jun 04, 2026, 03:10 AM
04 · pre-processing

Normalization & EDA

Toggle pipeline steps and inspect distributions before they hit feature engineering.

4/6 steps active
pipeline
Default normalization
  • 1Strip HTML & scripts
  • 2Unicode NFKC normalization
  • 3Collapse whitespace
  • 4Near-duplicate removal (MinHash)
  • 5PII detection & masking
  • 6Language filter
exploratory analysis
Field summary
05 · tokenization

Tokenizer configuration

Select strategy and tune chunk parameters. Preview updates as you adjust.

Max tokens8,192
Chunk size512 tok
Overlap64 tok
live preview
9 chunks · BPE
512 tok · 64 overlap
#1Transformer architectures rely o
#2ly on attention mechanisms to mo
#3o model relationships between to
#4n tokens across long contexts. I
#5s. In production, chunking and o
#6nd overlap parameters strongly a
#7ly affect retrieval quality and
#8and the cost of embedding genera
#9neration.
06 · features

Feature engineering

Compose feature transforms across text, tabular, graph, and embedding modalities.

3 of 4 enabled
TF-IDF + n-gramsText

Sparse lexical features with configurable n-gram range

Numeric & categorical statisticsTabular

Z-score normalization, target encoding, missingness flags

Graph featuresGraph

PageRank, betweenness, community embedding

Embedding poolingEmbeddings

Mean / max / attention pooling for sequence embeddings

07 · operations

Data views

Switch between row-level token operations and a 3D semantic graph of the corpus.

0 / 0 rows
Created — date range
Min confidence0.00
knowledge / corpus · frontend reference architectureapi · websocket · streaming layers pending