Skip to content

latacora/export-to-sqlite

Repository files navigation

export-to-sqlite

A Clojure library for turning data dumps (usually CSV) into SQLite databases. It is intentionally dataset-agnostic: you supply schema hints and overrides, and the library handles inference, table creation, and loading.

Goals

  • Accept CSV (and similar tabular dumps) and load them into SQLite.
  • Provide inspectable schema inference that can be overridden by humans or LLMs.
  • Make “mostly correct” imports easy while keeping the inevitable ambiguity visible.

Quick Start

(require '[com.latacora.export-to-sqlite.core :as ets]) (ets/apply-spec! {:db-path "export.sqlite" :spec {:tables [{:name "stations" :path "stations.csv" :primary-key ["station_id"] :columns [{:name "station_id" :type :integer} {:name "name" :type :text} {:name "is_operational" :type :boolean :boolean-true #{"1"} :boolean-false #{"0"}} {:name "opened_on" :type :date} {:name "tags" :type :text}] :indexes [{:columns ["station_id"]} {:columns ["name"] :collations {"name" "NOCASE"}}] :denorm [{:source-column "tags" :table "station_tags" :value-column "tag" :key-columns ["station_id"] :split ","}]} {:name "routes" :path "routes.csv"}]}})

Specs are plain data — easy to review, version, and share. Use Spec Inference to generate a starting point during development, then check the result into your code.

Spec Inference

spec/infer-spec uses heuristics (column names, sampled data, cardinality) to propose types, indexes, primary keys, foreign keys, and denormalization specs. It is a development-time tool — run it once to generate a starting point, review and adjust the output, then make the spec explicit in your code. The heuristics are not guaranteed stable across library versions, so code that calls infer-spec at runtime may see different results after a library upgrade.

Recommended workflow:

  1. Infer — call infer-spec at the REPL or in a scratch namespace
  2. Review — inspect the result (especially indexes, denorms, and FK detection)
  3. Customize — merge overrides via spec/merge-specs
  4. Commit — write the final spec as an explicit literal in your code
  5. Apply — pass it to apply-spec! at runtime
;; At the REPL during development: (require '[com.latacora.export-to-sqlite.spec :as spec]) (def inferred (spec/infer-spec {:csvs ["stations.csv" "routes.csv"] :opts {:sample-size 500 :header {:rename {"Station ID" "station_id"}}}})) ;; Review, then merge any overrides: (def final-spec (spec/merge-specs inferred {:tables [{:name "stations" :columns [{:name "is_operational" :type :boolean :boolean-true #{"1"} :boolean-false #{"0"}} {:name "opened_on" :type :date}]}]}))

Use the inferred result as a guide — inspect it, then write your explicit spec by hand using its :columns, :indexes, :denorm, :primary-key, and :foreign-keys as a starting point. Don't try to serialize the inferred spec directly: it can contain function values (:name-fn, :transform) that aren't EDN-readable. The inferred spec is a development aid, not a serialization format.

While inference is generally fast (it samples, not scans), the ingestion hot path (apply-spec!) never calls inference functions. Keeping inference out of your production code path ensures stable, predictable behavior.

Schema Hints

CSV rarely carries enough information to reliably distinguish types like booleans vs integers. Specs can be inferred and then refined via spec/merge-specs. Column overrides can include:

  • :source to map a column to a different CSV header name
  • :transform to apply a function before coercion
  • :boolean-true/:boolean-false to override boolean vocabularies
  • :primary-key to support denorm inference (split-csv)

Header Normalization

By default, all CSV headers are normalized to snake_case using camel-snake-kebab. For example, firstName becomes first_name and SERIALNO becomes serialno.

Rename keys match raw CSV headers (before name-fn is applied), so you can write:

{:header {:rename {"SERIALNO" "pums_household_id"}}}

To disable the default normalization (use headers verbatim):

{:header {:name-fn nil}}

To supply a custom normalization function:

{:header {:name-fn clojure.string/lower-case :rename {"Station ID" "station_id"}}}

Index Suggestions

During inference, infer-spec proposes indexes based on column names and sampled data. The default heuristics work as follows:

  1. Named patterns — columns matching _id/uuid/guid/code (exact-pattern), _type/_status (categorical-pattern), or name/title/email/slug (nocase-pattern) are always indexed.
  2. Cardinality catch-all — remaining columns with >= 10% cardinality and >= 10 distinct values (up to :max-distinct of 10,000) are indexed, unless their name matches the :exclude-pattern (prose/content fields like description, bio, body, summary, etc.).
  3. Foreign-key indexes — FK columns detected by infer-spec are indexed via a separate code path and bypass all heuristic filtering.

Precedence: :exclude-columns > :include-columns > named patterns > cardinality (gated by :exclude-pattern + :max-distinct).

Tune inference at the REPL via the :indexing key in opts:

;; At the REPL: (spec/infer-spec {:csvs ["data.csv"] :opts {:indexing {:exclude-columns #{"raw_html" "debug_blob"} :include-columns #{"region_code"} :max-distinct 5000}}})

To disable the default exclude pattern entirely (index prose/content columns via cardinality):

{:indexing {:exclude-pattern nil}}

To replace it with a custom pattern:

(require '[com.latacora.export-to-sqlite.schema :as schema]) {:indexing {:exclude-pattern (schema/name-pattern {:suffixes ["_blob" "_raw"]})}}

Once you're happy with the suggested indexes, copy them into your explicit spec as :indexes entries.

Column Renaming

When a column needs a different name in the database (e.g. a generic uuid should become investment_uuid), use spec/rename-column to keep :columns, [:header :rename], :primary-key, and :denorm in sync:

(-> table (spec/rename-column "uuid" "investment_uuid") (spec/rename-column "status" "investment_status"))

Calls compose — each rename merges into the existing :header :rename map without clobbering prior entries. Renaming a column that doesn't exist in the spec throws.

Denormalization Strategies

Multi-valued columns (like tags, categories, or *_list) can be split into join tables. suggest-denorms evaluates each column in order:

  1. :include matches — always suggest (bypasses all other checks)
  2. Prose column or :exclude matches — skip. Prose names (description, bio, body, text, summary, content, notes, details, remarks, comments, instructions) are excluded automatically.
  3. Name-match-strength :none — skip (no multi-value signal from name)
  4. Multi-value rate below threshold — skip (tiered: 2% for _list/_ids, 10% for plural names, 20% default)
  5. Median segment length > :max-segment-length — skip (default 20 chars; filters out comma-laden prose)

Tune inference at the REPL via the :denorm key in opts:

;; At the REPL: (spec/infer-spec {:csvs ["data.csv"] :opts {:denorm {:exclude #{"custom_field"} ;; set or (fn [col-name] -> truthy) :include #{"special_notes"} ;; force-include despite prose exclusion :max-segment-length 30}}}) ;; or nil to disable

Once you're happy with the suggestions, copy the denorm entries into your explicit spec.

Splitting Strategies

The :split key controls how multi-valued columns are split, both for inference and for the SQL INSERT ... SELECT that populates join tables:

:split value Meaning
"," (string) Delimiter-based splitting (default).
:json-array Values are JSON arrays.
(fn [v] ...) Custom splitter — receives a string, returns a seq of segments.

Example with :split :json-array:

{:denorm [{:source-column "tag_array" :table "item_tags" :value-column "tag" :key-columns ["item_id"] :split :json-array}]}

For columns that don't match the naming heuristic, provide explicit denorm specs via :denorm:

{:denorm [{:source-column "tags" :table "station_tags" :value-column "tag" :key-columns ["station_id"] :split ","}]}

The legacy :delimiter key is still accepted and mapped to :split for backward compatibility.

Incremental Updates

apply-spec! automatically detects whether the database file already exists. For a new file it does a fast bulk insert (synchronous=OFF); for an existing file it uses UPSERT so only changed rows are written. The same call handles both cases — just call apply-spec! every time you have new data:

(ets/apply-spec! {:db-path "production.sqlite" :spec {:tables [{:name "stations" :path "stations.csv" :primary-key ["station_id"] :columns [{:name "station_id" :type :integer} {:name "name" :type :text}]}]}})

How it works

Fresh build (file does not exist): rows are inserted directly into the main table with synchronous=OFF for maximum throughput.

Incremental update (file exists): each table is imported via a temp table:

  1. CSV rows are bulk-inserted into a CREATE TEMP TABLE (stored in memory/temp file, not in the main DB's WAL)
  2. INSERT ... ON CONFLICT DO UPDATE SET ... WHERE ... (UPSERT) syncs from temp into the main table — SQLite skips WAL writes for rows that haven't changed
  3. DELETE ... WHERE NOT EXISTS removes rows no longer in the source
  4. The temp table is dropped

Tables without primary keys are fully replaced (DELETE + INSERT).

Litestream replication

This is designed for use with litestream >= 0.5.x, which replicates SQLite WAL changes to S3 (or any supported replica backend). Since UPSERT only writes WAL frames for actually-changed rows, litestream ships minimal deltas:

Scenario WAL size (5-row DB) WAL size (5000-row DB)
No changes ~8 KB ~25 KB
1% of rows changed ~8 KB ~25 KB
10% of rows changed ~8 KB ~37 KB
100% of rows changed ~17 KB ~210 KB

WAL size scales with changed rows, not total database size. Give every table a primary key for minimal WAL output.

Project Tasks

  • bb lint runs clj-kondo
  • bb test runs Kaocha tests
  • bb maint checks for outdated deps via antq

License

EPL-2.0

About

Load tabular data (e.g. CSV) into SQLite, with smart schema inferencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors