A Clojure library for turning data dumps (usually CSV) into SQLite databases. It is intentionally dataset-agnostic: you supply schema hints and overrides, and the library handles inference, table creation, and loading.
- Accept CSV (and similar tabular dumps) and load them into SQLite.
- Provide inspectable schema inference that can be overridden by humans or LLMs.
- Make “mostly correct” imports easy while keeping the inevitable ambiguity visible.
(require '[com.latacora.export-to-sqlite.core :as ets]) (ets/apply-spec! {:db-path "export.sqlite" :spec {:tables [{:name "stations" :path "stations.csv" :primary-key ["station_id"] :columns [{:name "station_id" :type :integer} {:name "name" :type :text} {:name "is_operational" :type :boolean :boolean-true #{"1"} :boolean-false #{"0"}} {:name "opened_on" :type :date} {:name "tags" :type :text}] :indexes [{:columns ["station_id"]} {:columns ["name"] :collations {"name" "NOCASE"}}] :denorm [{:source-column "tags" :table "station_tags" :value-column "tag" :key-columns ["station_id"] :split ","}]} {:name "routes" :path "routes.csv"}]}})Specs are plain data — easy to review, version, and share. Use Spec Inference to generate a starting point during development, then check the result into your code.
spec/infer-spec uses heuristics (column names, sampled data, cardinality) to propose types, indexes, primary keys, foreign keys, and denormalization specs. It is a development-time tool — run it once to generate a starting point, review and adjust the output, then make the spec explicit in your code. The heuristics are not guaranteed stable across library versions, so code that calls infer-spec at runtime may see different results after a library upgrade.
Recommended workflow:
- Infer — call
infer-specat the REPL or in a scratch namespace - Review — inspect the result (especially indexes, denorms, and FK detection)
- Customize — merge overrides via
spec/merge-specs - Commit — write the final spec as an explicit literal in your code
- Apply — pass it to
apply-spec!at runtime
;; At the REPL during development: (require '[com.latacora.export-to-sqlite.spec :as spec]) (def inferred (spec/infer-spec {:csvs ["stations.csv" "routes.csv"] :opts {:sample-size 500 :header {:rename {"Station ID" "station_id"}}}})) ;; Review, then merge any overrides: (def final-spec (spec/merge-specs inferred {:tables [{:name "stations" :columns [{:name "is_operational" :type :boolean :boolean-true #{"1"} :boolean-false #{"0"}} {:name "opened_on" :type :date}]}]}))Use the inferred result as a guide — inspect it, then write your explicit spec by hand using its :columns, :indexes, :denorm, :primary-key, and :foreign-keys as a starting point. Don't try to serialize the inferred spec directly: it can contain function values (:name-fn, :transform) that aren't EDN-readable. The inferred spec is a development aid, not a serialization format.
While inference is generally fast (it samples, not scans), the ingestion hot path (apply-spec!) never calls inference functions. Keeping inference out of your production code path ensures stable, predictable behavior.
CSV rarely carries enough information to reliably distinguish types like booleans vs integers. Specs can be inferred and then refined via spec/merge-specs. Column overrides can include:
:sourceto map a column to a different CSV header name:transformto apply a function before coercion:boolean-true/:boolean-falseto override boolean vocabularies:primary-keyto support denorm inference (split-csv)
By default, all CSV headers are normalized to snake_case using camel-snake-kebab. For example, firstName becomes first_name and SERIALNO becomes serialno.
Rename keys match raw CSV headers (before name-fn is applied), so you can write:
{:header {:rename {"SERIALNO" "pums_household_id"}}}To disable the default normalization (use headers verbatim):
{:header {:name-fn nil}}To supply a custom normalization function:
{:header {:name-fn clojure.string/lower-case :rename {"Station ID" "station_id"}}}During inference, infer-spec proposes indexes based on column names and sampled data. The default heuristics work as follows:
- Named patterns — columns matching
_id/uuid/guid/code(exact-pattern),_type/_status(categorical-pattern), orname/title/email/slug(nocase-pattern) are always indexed. - Cardinality catch-all — remaining columns with >= 10% cardinality and >= 10 distinct values (up to
:max-distinctof 10,000) are indexed, unless their name matches the:exclude-pattern(prose/content fields like description, bio, body, summary, etc.). - Foreign-key indexes — FK columns detected by
infer-specare indexed via a separate code path and bypass all heuristic filtering.
Precedence: :exclude-columns > :include-columns > named patterns > cardinality (gated by :exclude-pattern + :max-distinct).
Tune inference at the REPL via the :indexing key in opts:
;; At the REPL: (spec/infer-spec {:csvs ["data.csv"] :opts {:indexing {:exclude-columns #{"raw_html" "debug_blob"} :include-columns #{"region_code"} :max-distinct 5000}}})To disable the default exclude pattern entirely (index prose/content columns via cardinality):
{:indexing {:exclude-pattern nil}}To replace it with a custom pattern:
(require '[com.latacora.export-to-sqlite.schema :as schema]) {:indexing {:exclude-pattern (schema/name-pattern {:suffixes ["_blob" "_raw"]})}}Once you're happy with the suggested indexes, copy them into your explicit spec as :indexes entries.
When a column needs a different name in the database (e.g. a generic uuid should become investment_uuid), use spec/rename-column to keep :columns, [:header :rename], :primary-key, and :denorm in sync:
(-> table (spec/rename-column "uuid" "investment_uuid") (spec/rename-column "status" "investment_status"))Calls compose — each rename merges into the existing :header :rename map without clobbering prior entries. Renaming a column that doesn't exist in the spec throws.
Multi-valued columns (like tags, categories, or *_list) can be split into join tables. suggest-denorms evaluates each column in order:
:includematches — always suggest (bypasses all other checks)- Prose column or
:excludematches — skip. Prose names (description,bio,body,text,summary,content,notes,details,remarks,comments,instructions) are excluded automatically. - Name-match-strength
:none— skip (no multi-value signal from name) - Multi-value rate below threshold — skip (tiered: 2% for
_list/_ids, 10% for plural names, 20% default) - Median segment length >
:max-segment-length— skip (default 20 chars; filters out comma-laden prose)
Tune inference at the REPL via the :denorm key in opts:
;; At the REPL: (spec/infer-spec {:csvs ["data.csv"] :opts {:denorm {:exclude #{"custom_field"} ;; set or (fn [col-name] -> truthy) :include #{"special_notes"} ;; force-include despite prose exclusion :max-segment-length 30}}}) ;; or nil to disableOnce you're happy with the suggestions, copy the denorm entries into your explicit spec.
The :split key controls how multi-valued columns are split, both for inference and for the SQL INSERT ... SELECT that populates join tables:
:split value | Meaning |
|---|---|
"," (string) | Delimiter-based splitting (default). |
:json-array | Values are JSON arrays. |
(fn [v] ...) | Custom splitter — receives a string, returns a seq of segments. |
Example with :split :json-array:
{:denorm [{:source-column "tag_array" :table "item_tags" :value-column "tag" :key-columns ["item_id"] :split :json-array}]}For columns that don't match the naming heuristic, provide explicit denorm specs via :denorm:
{:denorm [{:source-column "tags" :table "station_tags" :value-column "tag" :key-columns ["station_id"] :split ","}]}The legacy :delimiter key is still accepted and mapped to :split for backward compatibility.
apply-spec! automatically detects whether the database file already exists. For a new file it does a fast bulk insert (synchronous=OFF); for an existing file it uses UPSERT so only changed rows are written. The same call handles both cases — just call apply-spec! every time you have new data:
(ets/apply-spec! {:db-path "production.sqlite" :spec {:tables [{:name "stations" :path "stations.csv" :primary-key ["station_id"] :columns [{:name "station_id" :type :integer} {:name "name" :type :text}]}]}})Fresh build (file does not exist): rows are inserted directly into the main table with synchronous=OFF for maximum throughput.
Incremental update (file exists): each table is imported via a temp table:
- CSV rows are bulk-inserted into a
CREATE TEMP TABLE(stored in memory/temp file, not in the main DB's WAL) INSERT ... ON CONFLICT DO UPDATE SET ... WHERE ...(UPSERT) syncs from temp into the main table — SQLite skips WAL writes for rows that haven't changedDELETE ... WHERE NOT EXISTSremoves rows no longer in the source- The temp table is dropped
Tables without primary keys are fully replaced (DELETE + INSERT).
This is designed for use with litestream >= 0.5.x, which replicates SQLite WAL changes to S3 (or any supported replica backend). Since UPSERT only writes WAL frames for actually-changed rows, litestream ships minimal deltas:
| Scenario | WAL size (5-row DB) | WAL size (5000-row DB) |
|---|---|---|
| No changes | ~8 KB | ~25 KB |
| 1% of rows changed | ~8 KB | ~25 KB |
| 10% of rows changed | ~8 KB | ~37 KB |
| 100% of rows changed | ~17 KB | ~210 KB |
WAL size scales with changed rows, not total database size. Give every table a primary key for minimal WAL output.
bb lintruns clj-kondobb testruns Kaocha testsbb maintchecks for outdated deps via antq
EPL-2.0