AWS AURORA RDS – Loading Billion Rows

AWS AURORA RDS – Learnings from Loading Billion Rows to Aurora

Business Use Case:
We do online assessment and reporting for K-12 Education Industry. ( http://emetric.net/ ). Our business needed to load K-3 through K-12 students historical test reports for entire state for past 10 + years , which ended up being close to billion rows of historical test report data.

Product Selection:
We have been using SQL Server for majority of our database needs and we wanted to see whether AURORA would satisfy our business needs . Primary reason being Licensing Models of SQL Server is prohibitively expensive and scaling out has been challenging.

The features we liked out of the Box from Aurora are :
Elasticity – We can go up-to 244 Gig RAM 8x Instance.
Lower Cost
Simple in licensing Model and Pay as we go
We ended up using AURORA Reserved Instances since our demand doesn’t change and we have pretty predictable and stable traffic.

Data Processing :
Our source data is CSV Files. We had several programs and data layout is different for each program. Initially we decided to create Target Tables in AURORA Instance and then use MySQLImport Tool to load the CSV Files. This worked well but the number of files grew, we had to create multiple tables and load the data. It was getting unwieldy and we decided to try different approach.
The steps are :
Use Bash Scripts to Parse CSV and Fixed Length Data Files
Use Bash Scripts to Load into Staging Tables
Stored Procedures to Move Data from Staging Tables to Target Tables
Call the Stored Proc in multiple threads

LOAD INFILE OPTOIN

Nice thing about LOAD INFILE option , we can create one staging table for each program since the schema is same for each program. It helped us to trim our database to very few tables.

PARALLEL LOADING:

Loading 100 + Gig of data serially takes a while. We tried that with largest possible instance and it took about 14 or so hours.
We started evaluating other options for loading the data parallelly and wrote simple bash script to load the data parallel.

HANDLING NULLS:

While loading the Data , if the CSV File doesn’t contain a value for a particular field, AURORA loads empty value instead of NULL. For INT Data Types , it loads zero ( 0 ) , which skews the data averages. Averages are important metric for us which indicates the performance of a campus, district and state. To overcome this problem, you can set the value while you insert the data using NULLIF Statement. We found this out in hard way.

Moving Data to Target Tables::

We wanted to move the data from Staging Tables to Target Tables. This is performed as typical upsert (Update if exists , else insert ). Moving Billion rows sequentially is not even an option. So we wrote a stored procedure to move the data from Staging Table to Target Table based on startID and EndID. We had auto increment ID Column in staging tables which makes it easier to process the slices of data.

Performance Improvement configurations
Mondified the configuraion for innodb_autoinc_lock_mode

innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)

In this lock mode, no “INSERT-like” statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time. This is the fastest and most scalable lock mode, but it is not safe when using statement-based replication or recovery scenarios when SQL statements are replayed from the binary log.

Parameter Group Configuration Overrides

NameValueAllowed ValuesIs ModifiableSourceDescription
innodb_autoinc_lock_mode20-2TRUEuserThe locking mode to use for generating auto-increment values

ParameterOurCustomParametersdefault.aurora5.6
slow_query_log1engine-default
tx_isolationREAD-COMMITTEDengine-default
general_log1engine-default
long_query_time5engine-default
group_concat_max_len20000000engine-default 

7 comments on “AWS AURORA RDS – Loading Billion Rows
  1. mike says:

    nice article. really helpful.

    After data loading, do you have a fast way to create index?

  2. Hi.
    You say your load time was 14 hours when done serially. Can you tell us how much did your load time drop in parallel loading.
    Thanks
    Manuel

  3. Raju Venkataraman says:

    Hi Manuel,
    We were able to bring down our load time to 2-3 hours after parallel load process. It did improve the load time significantly

  4. Yanet says:

    Hi Raju,
    Thank you for this write up.
    Could you tell us what was the reserved instance type you used for your aurora db? (e.g. db.r3.large, db.r3.xlarge, etc).
    Thanks

  5. Raju Venkataraman says:

    Hello Yanet,
    We have been using db.r3.4xlarge for past few years.

  6. Yanet says:

    Thank you Raju!
    Very helpful your post.
    Yanet

  7. Abhishek Anand says:

    Hi Raju, Good article. Could you please share your stored proc – “Move_To_ODS_STORED_PROC”
    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

*