export-to-sqlite

A Clojure library for turning data dumps (usually CSV) into SQLite databases. It is intentionally dataset-agnostic: you supply schema hints and overrides, and the library handles inference, table creation, and loading.

Goals

Accept CSV (and similar tabular dumps) and load them into SQLite.
Provide inspectable schema inference that can be overridden by humans or LLMs.
Make “mostly correct” imports easy while keeping the inevitable ambiguity visible.

Quick Start

(require '[com.latacora.export-to-sqlite.core :as ets]) (ets/apply-spec! {:db-path "export.sqlite" :spec {:tables [{:name "stations" :path "stations.csv" :primary-key ["station_id"] :columns [{:name "station_id" :type :integer} {:name "name" :type :text} {:name "is_operational" :type :boolean :boolean-true #{"1"} :boolean-false #{"0"}} {:name "opened_on" :type :date} {:name "tags" :type :text}] :indexes [{:columns ["station_id"]} {:columns ["name"] :collations {"name" "NOCASE"}}] :denorm [{:source-column "tags" :table "station_tags" :value-column "tag" :key-columns ["station_id"] :split ","}]} {:name "routes" :path "routes.csv"}]}})

Specs are plain data — easy to review, version, and share. Use Spec Inference to generate a starting point during development, then check the result into your code.

Spec Inference

spec/infer-spec uses heuristics (column names, sampled data, cardinality) to propose types, indexes, primary keys, foreign keys, and denormalization specs. It is a development-time tool — run it once to generate a starting point, review and adjust the output, then make the spec explicit in your code. The heuristics are not guaranteed stable across library versions, so code that calls infer-spec at runtime may see different results after a library upgrade.

Recommended workflow:

Infer — call infer-spec at the REPL or in a scratch namespace
Review — inspect the result (especially indexes, denorms, and FK detection)
Customize — merge overrides via spec/merge-specs
Commit — write the final spec as an explicit literal in your code
Apply — pass it to apply-spec! at runtime

;; At the REPL during development: (require '[com.latacora.export-to-sqlite.spec :as spec]) (def inferred (spec/infer-spec {:csvs ["stations.csv" "routes.csv"] :opts {:sample-size 500 :header {:rename {"Station ID" "station_id"}}}})) ;; Review, then merge any overrides: (def final-spec (spec/merge-specs inferred {:tables [{:name "stations" :columns [{:name "is_operational" :type :boolean :boolean-true #{"1"} :boolean-false #{"0"}} {:name "opened_on" :type :date}]}]}))

Use the inferred result as a guide — inspect it, then write your explicit spec by hand using its :columns, :indexes, :denorm, :primary-key, and :foreign-keys as a starting point. Don't try to serialize the inferred spec directly: it can contain function values (:name-fn, :transform) that aren't EDN-readable. The inferred spec is a development aid, not a serialization format.

While inference is generally fast (it samples, not scans), the ingestion hot path (apply-spec!) never calls inference functions. Keeping inference out of your production code path ensures stable, predictable behavior.

Schema Hints

CSV rarely carries enough information to reliably distinguish types like booleans vs integers. Specs can be inferred and then refined via spec/merge-specs. Column overrides can include:

:source to map a column to a different CSV header name
:transform to apply a function before coercion
:boolean-true/:boolean-false to override boolean vocabularies
:primary-key to support denorm inference (split-csv)

Header Normalization

By default, all CSV headers are normalized to snake_case using camel-snake-kebab. For example, firstName becomes first_name and SERIALNO becomes serialno.

Rename keys match raw CSV headers (before name-fn is applied), so you can write:

{:header {:rename {"SERIALNO" "pums_household_id"}}}

To disable the default normalization (use headers verbatim):

{:header {:name-fn nil}}

To supply a custom normalization function:

{:header {:name-fn clojure.string/lower-case :rename {"Station ID" "station_id"}}}

Index Suggestions

During inference, infer-spec proposes indexes based on column names and sampled data. The default heuristics work as follows:

Named patterns — columns matching _id/uuid/guid/code (exact-pattern), _type/_status (categorical-pattern), or name/title/email/slug (nocase-pattern) are always indexed.
Cardinality catch-all — remaining columns with >= 10% cardinality and >= 10 distinct values (up to :max-distinct of 10,000) are indexed, unless their name matches the :exclude-pattern (prose/content fields like description, bio, body, summary, etc.).
Foreign-key indexes — FK columns detected by infer-spec are indexed via a separate code path and bypass all heuristic filtering.

Precedence: :exclude-columns > :include-columns > named patterns > cardinality (gated by :exclude-pattern + :max-distinct).

Tune inference at the REPL via the :indexing key in opts:

;; At the REPL: (spec/infer-spec {:csvs ["data.csv"] :opts {:indexing {:exclude-columns #{"raw_html" "debug_blob"} :include-columns #{"region_code"} :max-distinct 5000}}})

To disable the default exclude pattern entirely (index prose/content columns via cardinality):

{:indexing {:exclude-pattern nil}}

To replace it with a custom pattern:

(require '[com.latacora.export-to-sqlite.schema :as schema]) {:indexing {:exclude-pattern (schema/name-pattern {:suffixes ["_blob" "_raw"]})}}

Once you're happy with the suggested indexes, copy them into your explicit spec as :indexes entries.

Column Renaming

When a column needs a different name in the database (e.g. a generic uuid should become investment_uuid), use spec/rename-column to keep :columns, [:header :rename], :primary-key, and :denorm in sync:

(-> table (spec/rename-column "uuid" "investment_uuid") (spec/rename-column "status" "investment_status"))

Calls compose — each rename merges into the existing :header :rename map without clobbering prior entries. Renaming a column that doesn't exist in the spec throws.

Denormalization Strategies

Multi-valued columns (like tags, categories, or *_list) can be split into join tables. suggest-denorms evaluates each column in order:

:include matches — always suggest (bypasses all other checks)
Prose column or :exclude matches — skip. Prose names (description, bio, body, text, summary, content, notes, details, remarks, comments, instructions) are excluded automatically.
Name-match-strength :none — skip (no multi-value signal from name)
Multi-value rate below threshold — skip (tiered: 2% for _list/_ids, 10% for plural names, 20% default)
Median segment length > :max-segment-length — skip (default 20 chars; filters out comma-laden prose)

Tune inference at the REPL via the :denorm key in opts:

;; At the REPL: (spec/infer-spec {:csvs ["data.csv"] :opts {:denorm {:exclude #{"custom_field"} ;; set or (fn [col-name] -> truthy) :include #{"special_notes"} ;; force-include despite prose exclusion :max-segment-length 30}}}) ;; or nil to disable

Once you're happy with the suggestions, copy the denorm entries into your explicit spec.

Splitting Strategies

The :split key controls how multi-valued columns are split, both for inference and for the SQL INSERT ... SELECT that populates join tables:

`:split` value	Meaning
`","` (string)	Delimiter-based splitting (default).
`:json-array`	Values are JSON arrays.
`(fn [v] ...)`	Custom splitter — receives a string, returns a seq of segments.

Example with :split :json-array:

{:denorm [{:source-column "tag_array" :table "item_tags" :value-column "tag" :key-columns ["item_id"] :split :json-array}]}

For columns that don't match the naming heuristic, provide explicit denorm specs via :denorm:

{:denorm [{:source-column "tags" :table "station_tags" :value-column "tag" :key-columns ["station_id"] :split ","}]}

The legacy :delimiter key is still accepted and mapped to :split for backward compatibility.

Incremental Updates

apply-spec! automatically detects whether the database file already exists. For a new file it does a fast bulk insert (synchronous=OFF); for an existing file it uses UPSERT so only changed rows are written. The same call handles both cases — just call apply-spec! every time you have new data:

(ets/apply-spec! {:db-path "production.sqlite" :spec {:tables [{:name "stations" :path "stations.csv" :primary-key ["station_id"] :columns [{:name "station_id" :type :integer} {:name "name" :type :text}]}]}})

How it works

Fresh build (file does not exist): rows are inserted directly into the main table with synchronous=OFF for maximum throughput.

Incremental update (file exists): each table is imported via a temp table:

CSV rows are bulk-inserted into a CREATE TEMP TABLE (stored in memory/temp file, not in the main DB's WAL)
INSERT ... ON CONFLICT DO UPDATE SET ... WHERE ... (UPSERT) syncs from temp into the main table — SQLite skips WAL writes for rows that haven't changed
DELETE ... WHERE NOT EXISTS removes rows no longer in the source
The temp table is dropped

Tables without primary keys are fully replaced (DELETE + INSERT).

Litestream replication

This is designed for use with litestream >= 0.5.x, which replicates SQLite WAL changes to S3 (or any supported replica backend). Since UPSERT only writes WAL frames for actually-changed rows, litestream ships minimal deltas:

Scenario	WAL size (5-row DB)	WAL size (5000-row DB)
No changes	~8 KB	~25 KB
1% of rows changed	~8 KB	~25 KB
10% of rows changed	~8 KB	~37 KB
100% of rows changed	~17 KB	~210 KB

WAL size scales with changed rows, not total database size. Give every table a primary key for minimal WAL output.

Project Tasks

bb lint runs clj-kondo
bb test runs Kaocha tests
bb maint checks for outdated deps via antq

License

EPL-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.claude		.claude
.clj-kondo		.clj-kondo
.github/workflows		.github/workflows
docs		docs
resources/META-INF/native-image/com.latacora.export-to-sqlite		resources/META-INF/native-image/com.latacora.export-to-sqlite
src/com/latacora/export_to_sqlite		src/com/latacora/export_to_sqlite
test-resources		test-resources
test		test
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
bb.edn		bb.edn
build.clj		build.clj
deps.edn		deps.edn
tests-bench.edn		tests-bench.edn
tests.edn		tests.edn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

export-to-sqlite

Goals

Quick Start

Spec Inference

Schema Hints

Header Normalization

Index Suggestions

Column Renaming

Denormalization Strategies

Splitting Strategies

Incremental Updates

How it works

Litestream replication

Project Tasks

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

export-to-sqlite

Goals

Quick Start

Spec Inference

Schema Hints

Header Normalization

Index Suggestions

Column Renaming

Denormalization Strategies

Splitting Strategies

Incremental Updates

How it works

Litestream replication

Project Tasks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages