Read a csv and insert to database performance

Question

I have a mission to read a csv file line by line and insert them to database.

And the csv file contains about 1.7 million lines.

I use python with sqlalchemy orm(merge function) to do this. But it spend over five hours.

Is it caused by python slow performance or sqlalchemy or sqlalchemy?

or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)

Hope you guy giving any suggestion, thanks!

Update: database - mysql

You want LOAD DATA [LOCAL] INFILE. I can't help you with it but there are other questions on it and you can search around for official MySQL or other help on the Web. — twotwotwo
– twotwotwo, Commented Mar 22, 2016 at 6:25

Community · Accepted Answer · 2017-05-23 10:28:32Z

For such a mission you don't want to insert data line by line :) Basically, you have 2 ways:

Ensure that sqlalchemy does not run queries one by one. Use BATCH INSERT query (How to do a batch insert in MySQL) instead.
Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)

Ramesh Seera · Accepted Answer · 2018-06-06 09:37:52Z

Follow below three steps

Save the CSV file with the name of table what you want to save it to.
Execute below python script to create a table dynamically (Update CSV filename, db parameters)
Execute "mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"

PYTHON CODE:

import numpy as np import pandas as pd from mysql.connector import connect csv_file = 'dbtable_name.csv' df = pd.read_csv(csv_file) table_name = csv_file.split('.') query = "CREATE TABLE " + table_name[0] + "( \n" for count in np.arange(df.columns.values.size): query += df.columns.values[count] if df.dtypes[count] == 'int64': query += "\t\t int(11) NOT NULL" elif df.dtypes[count] == 'object': query += "\t\t varchar(64) NOT NULL" elif df.dtypes[count] == 'float64': query += "\t\t float(10,2) NOT NULL" if count == 0: query += " PRIMARY KEY" if count < df.columns.values.size - 1: query += ",\n" query += " );" #print(query) database = connect(host='localhost', # your host user='username', # username passwd='password', # password db='dbname') #dbname curs = database.cursor(dictionary=True) curs.execute(query) # print(query)

Collectives™ on Stack Overflow

Read a csv and insert to database performance

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related