knowledge / corpus

Corpus control plane

Operate the ingestion, normalization, tokenization, and embedding pipeline. Inspect quality at the row level, traverse the semantic graph, and ship reports.

sources

tokens

graph nodes

validated

01 · output

Report generator

Compose a corpus snapshot scoped to time range, sources, and format. Frontend stub — wire to your reporting service.

Time range — from

Time range — to

Data sources

all selectedOutput format

Include vectors

Adds embedding payloads (significantly larger files)

Include audit trail

Per-token lifecycle events and processing metadata

report preview

PDF — corpus.pdf

Range: May 22, 2026, 12:38 PM → Jun 21, 2026, 12:38 PM
Sources: all
Vectors: excluded
Audit: included

02 · ingestion

Source management

Validate, block, and inspect ingestion sources. Inline actions trigger pipeline events; metadata opens in a side panel.

0 sources

03 · automation

Scheduler

Cron-driven ingestion. Toggle auto-update, fire manual runs, or inspect the recent timeline.

Cron expressionUTC · 5-field

Auto-update enabled

When off, only manual runs are executed

Last run

2h ago

Jun 21, 2026, 10:38 AM

Next run

in 4h

Jun 21, 2026, 04:38 PM

run history

Last 14 runs

ok warn err

Jun 08, 2026, 12:38 PMcron · 0 */6 * * *Jun 21, 2026, 12:38 PM

04 · pre-processing

Normalization & EDA

Toggle pipeline steps and inspect distributions before they hit feature engineering.

4/6 steps active

pipeline

Default normalization

1Strip HTML & scripts
2Unicode NFKC normalization
3Collapse whitespace
4Near-duplicate removal (MinHash)
5PII detection & masking
6Language filter

exploratory analysis

Field summary

05 · tokenization

Tokenizer configuration

Select strategy and tune chunk parameters. Preview updates as you adjust.

Tokenizer

Max tokens8,192

Chunk size512 tok

Overlap64 tok

Strip HTML

Lowercase

Preserve code blocks

live preview

9 chunks · BPE

512 tok · 64 overlap

#1Transformer architectures rely o

#2ly on attention mechanisms to mo

#3o model relationships between to

#4n tokens across long contexts. I

#5s. In production, chunking and o

#6nd overlap parameters strongly a

#7ly affect retrieval quality and

#8and the cost of embedding genera

#9neration.

06 · features

Feature engineering

Compose feature transforms across text, tabular, graph, and embedding modalities.

3 of 4 enabled

TF-IDF + n-gramsText

Sparse lexical features with configurable n-gram range

Numeric & categorical statisticsTabular

Z-score normalization, target encoding, missingness flags

Graph featuresGraph

PageRank, betweenness, community embedding

Embedding poolingEmbeddings

Mean / max / attention pooling for sequence embeddings

07 · operations

Data views

Switch between row-level token operations and a 3D semantic graph of the corpus.

0 / 0 rows

SourcesallValidation statusEmbedding model

Created — date range

Min confidence0.00