1

I'm using MySQL to work with a large log file (300 million records or so) with four columns (two varchars, an int, and a key), but it's taking a long time.

The goal is to dig through the log file and find records who are taking a certain action at a high frequency.

Records with a status of A or U during events higher than an arbitrary eventID. I'm inserting them into a new table using a GROUP BY and it's taking upwards of the entire day to run. Is there a way to do this faster?

INSERT INTO `tbl_FrequentActions`(`ActionCount`, `RecordNumber`) SELECT COUNT(`idActionLog`) as 'ActionCount', `RecordNumber` FROM `ActionLog` WHERE (`ActionStatus` like 'D' or `ActionStatus` like 'U') AND `EventID` > 103 GROUP BY `RecordNumber` HAVING COUNT(`idActionLog`) > 19 ; 

Would it be faster to use temporary tables to run the WHERE arguments separately. Like create temporary tables to cut everything thing down before I ran the GROUP BY?

All fields in the ActionLog are indexed.

EDIT: All the data is already in the log database in one table. It was mentioned that I was ambiguous on that point earlier.

The indexes are individual to the column.

EDIT2: Somebody asked if my log file buffers were correctly configured for something of this size, and that's a great question, but I don't know. Yes, it is in an InnoDB format.

I built a test table of a couple million records and ran the query on there. It took 1 minute 30 seconds. I broke the query down into using a temporary table to handle all of the where clause then ran the GROUP BY query on the temporary table. That knocked the time down to under a minute. So there is a several hour savings.

EDIT3: Can I use 'ON DUPLICATE UPDATE' to make this faster? I tried this, but it just ran forever. I think it's a Cartesian error. Do I need to alias the tables somehow?

INSERT INTO `tbl_FrequentActions`(`ActionCount`, `RecordNumber`) SELECT '1' as 'ActionCount', `RecordNumber` FROM `ActionLog` WHERE (`Status` like 'D' or `Status` like 'U') AND `EventID` > 103 ON DUPLICATE KEY UPDATE `DeliveryCount` = (`DeliveryCount` + 1) ; 
8
  • 1
    Your question title and preface suggests you are having issues getting the data into the database from a file, but it's question content presents an issue with dealing with the data after import. Which is it? Commented Dec 11, 2018 at 0:01
  • 1
    Changing like to = might help. I'm not sure that MySQL will correctly optimise ActionStatus like 'D' to ActionStatus = 'D' Commented Dec 11, 2018 at 0:40
  • 1
    How large is the table on disk? Are you using InnoDB? Is the MySql is optimized with enough buffers, appropriate log file size and etc? Commented Dec 11, 2018 at 2:46
  • 1
    How selective is the WHERE clause? Can you show the EXPLAIN output for the current plan? Commented Dec 11, 2018 at 12:42
  • 1
    "It took 1 minute 30" -- How many GB in that table? In the full table? How much RAM? What is the value of innodb_buffer_pool_size? (My point is that the timings for the smaller table may not extrapolate to the bigger table.) Commented Dec 11, 2018 at 23:25

1 Answer 1

1

This sounds like a 'standard' summary table for a Data Warehouse application. I'll state a couple of assumptions, then discuss how to do that. The resulting query may take an hour; it may take only minutes.

  • The ActionLog is huge, but it is only "added" to. You never UPDATE or DELETE data (except perhaps for aging out old data).
  • "an arbitrary eventID" is really something more regular, such as "start of some day".

To start with, you would need to summarize most of the 300M rows into the summary table(s). Then, on a daily (or hourly?) basis, new data is summarized -- this being a rather quick operation. Alternatively, IODKU can be used. Before deciding which, we need to understand the frequency of inserting into ActionLog. (It is probably rapid.) Do the log entries come in batches? Or one at a time?

Then the 'report' query would be performed against the Summary table, and run lots faster than running against the 'Fact' table (ActionLog).

Typical summary tables work off EventDate >= '2018-04-01' instead of EventID > 103. So, I need some help in understanding where "103" comes from.

How many different values are there for Status? We need to decide between having multiple rows and having multiple columns.

For further insight into where I am headed: Summary Tables and High speed ingestion

Sign up to request clarification or add additional context in comments.

1 Comment

Hey thanks for this. This was helpful, especially the links you provided. Your describing my application exactly. The eventID is used instead of date because I was trying to save a column. The precursor query does a """SELECT MIN(EventID) FROM EventLog WHERE EvenDate = 'XYZ';""" As for the Status field, it has three values and 50% of them 50% of are 'D'.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.