A quick overview of DB2’s INGEST

rpillaiJan 31, 2013

As a DB2 DBA or as data integrator we all face a challenge in loading large amount of data into DB2 database while up keeping the 24×7 availability of the tables. Â Traditionally we use LOAD or sometimes IMPORT while loading large amount of data to tables. Either of them has data concurrency and availability issue. While INSERTS gives the highest availability, it can be painfully slow when inserting millions of rows.

Brief Intro :

INGEST from DB2 10 is fast and it enables us to continuously Â load Â data while giving high availability and recoverability. INGEST is a cross between LOAD and IMPORT. INGEST is optimized to be multithread to process data in parallel. Â Unlike LOAD and IMPORT, Â INGEST can run continuously reading a data stream through a pipe. This is great, if you are loading live feeds of data from a social media stream or from your Â production stream. Â Another feature of INGEST is, it can read multiple files as its source to load the data. Apart from the traditional INSERT statement, it can do UPDATE, MERGE and DELETE. This comes in handy if the data you are loading Â contain Â updates to existing rows .

How does it work ?

INGEST has three threads/parts :

1. Transporter : Â This thread reads from the input files or data stream pipe s in parallel and sets the thread to formatter. DB2 spawns one thread for each input files.

2. Formatter: Formatter threads formats and prepares the data to be send to Flusher. Formatter threads make sure the data is set in such a way that INGEST can insert into DB2 in parallel.Â The number of formatter threads is specified by theÂ num_formattersÂ configuration parameter.

3. Flusher: Â Flusher threads insert the data into the DB2 table. If you are using INGEST to load into a partitioned database, then a flusher thread is associated with each partition. Â There could be multiple flushers running to increase the throughput of the data ingest.Â The number of flushers for each partition is specified by the num_flushers_per_partitionÂ configuration parameter.

Like I said its completely designed to be run efficiently and give high availability.

Can it recover from a failure ?

Absolutely ! INGEST is capable of restarting from where it failed. It literally uses INSERT statements to load the data. With INGEST, you can specify dumpfile to place any rejected rows due to formatting error or invalid character, exception table to place any rows with constraint violations, warning counts to stop if the threshold is met. Â You can also specify commit count in rows or by time. This comes in handy when loading data continuously from a stream. If the INGEST utility gets aborted for any reasons (power failure or somebody accidentally killing the thread, you can restart the INGEST with RESTART option and DB2 will resume the INGEST from the last commit point. If do not want to restart INGEST you can issue the RESTART TERMINATE and DB2 will clean up the failed INGEST.Â Unlike with theÂ TERMINATEÂ parameter of theÂ LOADÂ command, data already committed by the failed INGESTÂ command is not removed from the table.

What all format can it handle ?

INGESTÂ utlitiy Â supports the following input data formats:

Delimited text
Positional text and binary
Columns in various orders and formats

How about performance ?

Obviously this is one of the most sacred part for the DBA when choosing the option to load lot of data. Â The best performance is always LOAD, but it doesn’t give high availability. INGEST even though under the covers issues INSERTS, I have noticed that when loading hundred thousands of rows INGEST was able to perform 40% faster than traditional INSERT. The reason why I compared to INSERT, is because its the only option that gives me high availability.

How is it run ?

Assume we have a coma delimited customer file as follows that needs to be loaded :

10001,”Edward”,”Williams”,”12349 West Ave St”,”Taos”,”NM”,”564-573-4839″
10002,”James”,”Miller”,”43348 Knight St”,”Dumont”,”NJ”,”201-873-4239″
10003,”Jessica”,”Robertson”,”489 River N Rd”,”Austin”,”TX”,”512-534-9859″
: Â Â Â : Â Â : Â Â : Â Â : Â : Â : Â Â : Â Â : Â : Â Â : Â Â : Â Â Â : Â Â : Â Â : Â Â : Â : Â : Â Â : Â Â : Â : Â Â : Â : Â Â Â : Â Â : Â Â :
90000,”Kathy”,”Rodriguez”,”8293 Brooksheild Apt, # 201″,”Anahiem”,”CA”,”714-354-775

The customer table is :
CNumber Â INT Â Â Â Â Â —> primary key
FName Â Â Â varchar(50)
LName Â Â Â varchar(50)
Street Â Â Â Â varchar(100)
City Â Â Â Â Â Â varchar(50)
State Â Â Â Â Â char(2)
Telephone Â char(12)

Lets see our options to INGEST the data:

This is the simplest form :
ingest from file customer.del format delimited insert into raju.customer;

If you want to have more control or if the data is in different order than the table column, then you can do this . Note here I have flipped the firstname and lastname on the VALUES:

ingest from file customer.del format delimited
($cid integer external
,$firstname char(50)
,$lastname char(50)
,$streetname char(100)
,$cityname char(50)
,$statename char(2)
,$telephone char(12)
)
dumpfile customer_dump.del
exception table raju.customerexcept
warningcount 100
messages ingest_customer.msg
insert into raju.customer(cnumber,fname,lname,street,city,state,telephone)
values($cid,$lastname,$firstname,$streetname,$cityname,$statename,$telephone);

The exception table should be in place before you run this. You can create the table with :
create table raju.customerexcept like raju.customer Â or run the same ddl you used for the customer table.

If your data contains updates toÂ existingÂ rows as well as new rows, you can use MERGE statement on your INGEST :

ingest from file customer_monday.del, customer_tuesday.del format delimited
($cid integer external
,$firstname char(50)
,$lastname char(50)
,$streetname char(100)
,$cityname char(50)
,$statename char(2)
,$telephone char(12)
)
dumpfile customer_dump.del
warningcount 100
messages ingest4.msg
merge into raju.customer on (cnumber = $cid)
when not matched then
insert (cnumber,fname,lname,street,city,state,telephone)
values($cid,$firstname,$lastname,$streetname,$cityname,$statename,$telephone)
when matched then
update set (fname,lname,street,city,state,telephone) =
($firstname,$lastname,$streetname,$cityname,$statename,$telephone);

Note that in the above example, I have two input files. Â I have also removed the exception table. DB2 does not allow to use exception table option when using MERGE option. Don’t ask me why they don’t allow ?

If you have a list of rows that you want to delete, then you can do as well :

ingest from file customer1.del format delimited
($cid integer external
,$firstname char(50)
,$lastname char(50)
,$streetname char(100)
,$cityname char(50)
,$statename char(2)
,$telephone char(12)
)
delete from raju.customer
where cnumber = $cid;

How do we restart ?

You can restart a terminated INGEST with the RESTART CONTINUE <job_id> option

If you want the ability to easily restart a terminated INGEST use the option RESTART NEW <job_id> on your original INGEST command like :

ingest from file customer_monday.del, customer_tuesday.del format delimited
($cid integer external
,$firstname char(50)
,$lastname char(50)
,$streetname char(100)
,$cityname char(50)
,$statename char(2)
,$telephone char(12)
)
dumpfile customer_dump.del
warningcount 100
messages ingest4.msg
restart new ‘customeringest1’
merge into raju.customer on (cnumber = $cid)
when not matched then
insert (cnumber,fname,lname,street,city,state,telephone)
values($cid,$firstname,$lastname,$streetname,$cityname,$statename,$telephone)
when matched then
update set (fname,lname,street,city,state,telephone) =
($firstname,$lastname,$streetname,$cityname,$statename,$telephone);

The RESTART NEW option will help you identify the ingest command that got terminated. Once you identify the ingest that got terminated, you can restart it by issuing RESTART CONTINUE

ingest from file customer_monday.del, customer_tuesday.del format delimited
($cid integer external
,$firstname char(50)
,$lastname char(50)
,$streetname char(100)
,$cityname char(50)
,$statename char(2)
,$telephone char(12)
)
dumpfile customer_dump.del
warningcount 100
messages ingest4.msg
restart continue ‘customeringest1’
merge into raju.customer on (cnumber = $cid)
when not matched then
insert (cnumber,fname,lname,street,city,state,telephone)
values($cid,$firstname,$lastname,$streetname,$cityname,$statename,$telephone)
when matched then
update set (fname,lname,street,city,state,telephone) =
($firstname,$lastname,$streetname,$cityname,$statename,$telephone);

The RESTART CONTINUE option will enable us to restart INGEST from the last commit point.

If you want to terminate the utility and want to clean up from the previously failed run, then you can use Â RESTART TERMINATE <job_id> option just like the RESTART CONTINUE option.

How can I monitor the utility ?

There are few ways to watch what INGEST utility is doing . Â IBM has provided few options:

1. INGEST GET STATS
2. INGEST LIST or
3 you can query TABLE(MON_GET_CONNECTION(NULL,NULL)) and look for the application_name DB2_INGEST

Conclusion

While INGEST has a lot of room for improvement, its a great utility to have when compared to other options. Its easy to use and keeps up the availability while inserting large amount of data. I also like the fact that we can use it to delete rows for a list of values. Â I hate to use large IN clause in my delete statements or have to load the key into a temporary tableÂ scenario. Few limitation with INGEST you need to aware of , currently INGEST do now allow LOBs, XML, structured types, has lot of restrictions on generated columns. I definitely recommend anyone to try out the INGEST utility and see how you can benefit from it. After all you paid for the product, so the least you can do is try it. Trust me if you use it right, you will love it !

Related Posts

Backing up Aurora MySQL

How to access Aurora MySQL from Python code via a bastion server

Creating an EMR cluster using CLI