0

I have a mission to read a csv file line by line and insert them to database.

And the csv file contains about 1.7 million lines.

I use python with sqlalchemy orm(merge function) to do this. But it spend over five hours.

Is it caused by python slow performance or sqlalchemy or sqlalchemy?

or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)

Hope you guy giving any suggestion, thanks!

Update: database - mysql

3
  • What database? Please tag your question Commented Mar 22, 2016 at 6:10
  • ok, I had updated my tag and content. Commented Mar 22, 2016 at 6:15
  • You want LOAD DATA [LOCAL] INFILE. I can't help you with it but there are other questions on it and you can search around for official MySQL or other help on the Web. Commented Mar 22, 2016 at 6:25

2 Answers 2

2

For such a mission you don't want to insert data line by line :) Basically, you have 2 ways:

  1. Ensure that sqlalchemy does not run queries one by one. Use BATCH INSERT query (How to do a batch insert in MySQL) instead.
  2. Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)
Sign up to request clarification or add additional context in comments.

1 Comment

OK, i will try them!
0

Follow below three steps

  1. Save the CSV file with the name of table what you want to save it to.
  2. Execute below python script to create a table dynamically (Update CSV filename, db parameters)
  3. Execute "mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"

PYTHON CODE:

import numpy as np import pandas as pd from mysql.connector import connect csv_file = 'dbtable_name.csv' df = pd.read_csv(csv_file) table_name = csv_file.split('.') query = "CREATE TABLE " + table_name[0] + "( \n" for count in np.arange(df.columns.values.size): query += df.columns.values[count] if df.dtypes[count] == 'int64': query += "\t\t int(11) NOT NULL" elif df.dtypes[count] == 'object': query += "\t\t varchar(64) NOT NULL" elif df.dtypes[count] == 'float64': query += "\t\t float(10,2) NOT NULL" if count == 0: query += " PRIMARY KEY" if count < df.columns.values.size - 1: query += ",\n" query += " );" #print(query) database = connect(host='localhost', # your host user='username', # username passwd='password', # password db='dbname') #dbname curs = database.cursor(dictionary=True) curs.execute(query) # print(query) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.