Skip to content

xtdb/xtdb-kaggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

XTDB Kaggle

A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.

At the moment, it’s only got a transformer for one dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata. If you do implement transformers for others, please do submit them as PRs!

Setup

xtdb-kaggle is a REPL based tool at the moment. To get set up:

  • Clone the repo

  • Get yourself a Kaggle API key file - create an account, head to your account settings and download an API key JSON file

  • Set a KAGGLE_KEY_FILE environment variable pointing to the key file

  • Start a REPL and connect to it in your usual way.

Then, find yourself an interesting dataset on Kaggle

You need to tell xtdb-kaggle which files you’d like to download, and then how to turn each file into XTDB operations - this is done using multimethods.

Using that movie dataset as an example - we have an :owner-slug of "tmdb", a :dataset-slug of "tmdb-movie-metadata", and two files: "tmdb_5000_movies.csv" and "tmdb_5000_credits.csv".

We define dataset-file-names to specify the files, and one instance of csv-row→ops-fn for each file:

(defmethod dataset-file-names ["tmdb" "tmdb-movie-metadata"] [_] #{"tmdb_5000_movies.csv" "tmdb_5000_credits.csv"}) (defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_movies.csv"] [_] (fn [{:strs [id title runtime budget revenue keywords genres] :as row}] [[::xt/put {:xt/id (keyword (name 'tmdb.movie) id) :tmdb/type :movie :tmdb.movie/id (Long/parseLong id) :tmdb.movie/title title :tmdb.movie/budget (some-> budget Long/parseLong) :tmdb.movie/revenue (some-> revenue Long/parseLong) :tmdb.movie/keywords (->> (json/read-value keywords) (into #{} (map #(get % "name")))) :tmdb.movie/genres (->> (json/read-value genres) (into #{} (map #(get % "name"))))}]])) (defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_credits.csv"] [_] (fn [{:strs [movie_id cast] :as row}] (let [movie-id (Long/parseLong movie_id)] (->> (for [{cast-name "name", :strs [credit_id id character]} (json/read-value cast)] [[::xt/put {:xt/id (keyword (name 'tmdb.cast) (str id)) :tmdb/type :cast :tmdb.cast/id id :tmdb.cast/name cast-name}] [::xt/put {:xt/id (keyword (name 'tmdb.credit) credit_id) :tmdb/type :credit :tmdb.movie/id movie-id :tmdb.cast/id id :tmdb.cast/character character}]]) (apply concat)))))

Then, we can stream the dataset to a local file of XTDB transaction ops using:

(->> (dataset->ops {:owner-slug "tmdb", :dataset-slug "tmdb-movie-metadata"}) (ops->stream (io/output-stream (io/file "/tmp/movies.edn"))))

Have fun!

About

A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors