cpimport is a high-speed bulk load utility that imports data into ColumnStore tables in a fast and efficient manner. It accepts as input any flat file containing data that contains a delimiter between fields of data (i.e. columns in a table). The default delimiter is the pipe (‘|’) character, but other delimiters such as commas may be used as well. The data values must be in the same order as the create table statement, i.e. column 1 matches the first column in the table and so on. Date values must be specified in the format 'yyyy-mm-dd'.
cpimport – performs the following operations when importing data into a MariaDB ColumnStore database:
It is important to note that:
There are two primary steps to using the cpimport utility:
The simplest form of cpimport command is
cpimport dbName tblName [loadFile]
The full syntax is like this:
cpimport dbName tblName [loadFile] [-h] [-m mode] [-f filepath] [-d DebugLevel] [-c readBufferSize] [-b numBuffers] [-r numReaders] [-e maxErrors] [-B libBufferSize] [-s colDelimiter] [-E EnclosedByChar] [-C escChar] [-j jobID] [-p jobFilePath] [-w numParsers] [-n nullOption] [-P pmList] [-i] [-S] [-q batchQty] positional parameters: dbName Name of the database to load tblName Name of table to load loadFile Optional input file name in current directory, unless a fully qualified name is given. If not given, input read from STDIN. Options: -b Number of read buffers -c Application read buffer size(in bytes) -d Print different level(1-3) debug message -e Max number of allowable error per table per PM -f Data file directory path. Default is current working directory. In Mode 1, -f represents the local input file path. In Mode 2, -f represents the PM based input file path. In Mode 3, -f represents the local input file path. -l Name of import file to be loaded, relative to -f path. (Cannot be used with -p) -h Print this message. -q Batch Quantity, Number of rows distributed per batch in Mode 1 -i Print extended info to console in Mode 3. -j Job ID. In simple usage, default is the table OID. unless a fully qualified input file name is given. -n NullOption (0-treat the string NULL as data (default); 1-treat the string NULL as a NULL value) -p Path for XML job description file. -r Number of readers. -s 'c' is the delimiter between column values. -B I/O library read buffer size (in bytes) -w Number of parsers. -E Enclosed by character if field values are enclosed. -C Escape character used in conjunction with 'enclosed by' character, or as part of NULL escape sequence ('\N'); default is '\' -I Import binary data; how to treat NULL values: 1 - import NULL values 2 - saturate NULL values -P List of PMs ex: -P 1,2,3. Default is all PMs. -S Treat string truncations as errors. -m mode 1 - rows will be loaded in a distributed manner across PMs. 2 - PM based input files loaded onto their respective PM. 3 - input files will be loaded on the local PM.
In this mode, you run the cpimport from a central location(either UM or PM). The source file is located at this central location and the data from cpimport is distributed across all the PM nodes. If no mode is specified, then this is the default for cpimport mode. The central location where cpimport is being run from could be UM or any one of the PM.
Example cpimport -m1 mytest mytable mytable.tbl
In this mode, you run the cpimport from a central location(either UM or PM). The source data is in already partitioned data files residing on the PMs. Each PM should have the source data file of the same name but containing the partitioned data for the PM
Example cpimport -m2 mytest mytable -l /home/mydata/mytable.tbl
In this mode, you run cpimport from the individual PM nodes independently, which will import the source file that exists on that PM. Concurrent imports can be executed on every PM for the same table.
Example cpimport -m3 mytest mytable /home/mydata/mytable.tbl
Note:
Data can be loaded from STDIN into ColumnStore by simply not including the loadFile parameter
Example:
cpimport db1 table1
Standard in can also be used to directly pipe the output from an arbitrary SELECT statement into cpimport. The select statement may select from non-columnstore tables such as MyISAM or InnoDB. In the example below, the db2.source_table is selected from, using the -N flag to remove non-data formatting. The -q flag tells the mysql client to not cache results which will avoid possible timeouts causing the load to fail.
mcsmysql -q -e 'select * from source_table;' -N <source-db> | /usr/local/mariadb/columnstore/bin/cpimport -s '\t' <target-db> <target-table>
Similarly the AWS cli utility can be utilized to read data from an s3 bucket and pipe the output into cpimport allowing direct loading from S3. This assumes the aws cli program has been installed and configured on the host:
aws s3 cp --quiet s3://dthompson-test/trades_bulk.csv - | cpimport test trades -s ","
For troubleshooting connectivity problems remove the --quiet option which suppresses client logging including permission errors.
There are two ways multiple tables can be loaded:
colxml mytest -j299 cpimport -m1 -j299
Usage: colxml [options] dbName Options: -d Delimiter (default '|') -e Maximum allowable errors (per table) -h Print this message -j Job id (numeric) -l Load file name -n "name in quotes" -p Path for XML job description file that is generated -s "Description in quotes" -t Table name -u User -r Number of read buffers -c Application read buffer size (in bytes) -w I/O library buffer size (in bytes), used to read files -x Extension of file name (default ".tbl") -E EnclosedByChar (if data has enclosed values) -C EscapeChar -b Debug level (1-3)
The following tables comprise a database name ‘tpch2’:
MariaDB[tpch2]> show tables; +---------------+ | Tables_in_tpch2 | +--------------+ | customer | | lineitem | | nation | | orders | | part | | partsupp | | region | | supplier | +--------------+ 8 rows in set (0.00 sec)
/usr/local/mariadb/columnstore/bin/colxml tpch2 -j500 Running colxml with the following parameters: 2015-10-07 15:14:20 (9481) INFO : Schema: tpch2 Tables: Load Files: -b 0 -c 1048576 -d | -e 10 -j 500 -n -p /usr/local/mariadb/columnstore/data/bulk/job/ -r 5 -s -u -w 10485760 -x tbl File completed for tables: tpch2.customer tpch2.lineitem tpch2.nation tpch2.orders tpch2.part tpch2.partsupp tpch2.region tpch2.supplier Normal exit.
Now actually run cpimport to use the job file generated by the colxml execution
/usr/local/mariadb/columnstore/bin/cpimport -j 500 Bulkload root directory : /usr/local/mariadb/columnstore/data/bulk job description file : Job_500.xml 2015-10-07 15:14:59 (9952) INFO : successfully load job file /usr/local/mariadb/columnstore/data/bulk/job/Job_500.xml 2015-10-07 15:14:59 (9952) INFO : PreProcessing check starts 2015-10-07 15:15:04 (9952) INFO : PreProcessing check completed 2015-10-07 15:15:04 (9952) INFO : preProcess completed, total run time : 5 seconds 2015-10-07 15:15:04 (9952) INFO : No of Read Threads Spawned = 1 2015-10-07 15:15:04 (9952) INFO : No of Parse Threads Spawned = 3 2015-10-07 15:15:06 (9952) INFO : For table tpch2.customer: 150000 rows processed and 150000 rows inserted. 2015-10-07 15:16:12 (9952) INFO : For table tpch2.nation: 25 rows processed and 25 rows inserted. 2015-10-07 15:16:12 (9952) INFO : For table tpch2.lineitem: 6001215 rows processed and 6001215 rows inserted. 2015-10-07 15:16:31 (9952) INFO : For table tpch2.orders: 1500000 rows processed and 1500000 rows inserted. 2015-10-07 15:16:33 (9952) INFO : For table tpch2.part: 200000 rows processed and 200000 rows inserted. 2015-10-07 15:16:44 (9952) INFO : For table tpch2.partsupp: 800000 rows processed and 800000 rows inserted. 2015-10-07 15:16:44 (9952) INFO : For table tpch2.region: 5 rows processed and 5 rows inserted. 2015-10-07 15:16:45 (9952) INFO : For table tpch2.supplier: 10000 rows processed and 10000 rows inserted.
If there are some differences between the input file and table definition then the colxml utility can be utilized to handle these cases:
In this case run the colxml utility (the -t argument can be useful for producing a job file for one table if preferred) to produce the job xml file and then use this a template for editing and then subsequently use that job file for running cpimport.
Consider the following simple table example:
create table emp ( emp_id int, dept_id int, name varchar(30), salary int, hire_date date) engine=columnstore;
This would produce a colxml file with the following table element:
<Table tblName="test.emp" loadName="emp.tbl" maxErrRow="10"> <Column colName="emp_id"/> <Column colName="dept_id"/> <Column colName="name"/> <Column colName="salary"/> <Column colName="hire_date"/> </Table>
If your input file had the data such that hire_date comes before salary then the following modification will allow correct loading of that data to the original table definition (note the last 2 Column elements are swapped):
<Table tblName="test.emp" loadName="emp.tbl" maxErrRow="10"> <Column colName="emp_id"/> <Column colName="dept_id"/> <Column colName="name"/> <Column colName="hire_date"/> <Column colName="salary"/> </Table>
The following example would ignore the last entry in the file and default salary to it's default value (in this case null):
<Table tblName="test.emp" loadName="emp.tbl" maxErrRow="10"> <Column colName="emp_id"/> <Column colName="dept_id"/> <Column colName="name"/> <Column colName="hire_date"/> <IgnoreField/> <DefaultColumn colName="salary"/> </Table>
Both instructions can be used indepedently and as many times as makes sense for your data and table definition.
It is possible to import using a binary file instead of a CSV file using fixed length rows in binary data. This can be done using the '-I' flag which has two modes:
Example cpimport -I1 mytest mytable /home/mydata/mytable.bin
The following table shows how to represent the data in the binary format:
Datatype | Description |
---|---|
INT/TINYINT/SMALLINT/BIGINT | Little-endian format for the numeric data |
FLOAT/DOUBLE | IEEE format native to the computer |
CHAR/VARCHAR | Data padded with '\0' for the length of the field. An entry that is all '\0' is treated as NULL |
DATE | Using the Date struct below |
DATETIME | Using the DateTime struct below |
DECIMAL | Stored using an integer representation of the DECIMAL without the decimal point. With precision/width of 2 or less 2 bytes should be used, 3-4 should use 3 bytes, 4-9 should use 4 bytes and 10+ should use 8 bytes |
For NULL values the following table should be used:
Datatype | Signed NULL | Unsigned NULL |
---|---|---|
BIGINT | 0x8000000000000000ULL | 0xFFFFFFFFFFFFFFFEULL |
INT | 0x80000000 | 0xFFFFFFFE |
SMALLINT | 0x8000 | 0xFFFE |
TINYINT | 0x80 | 0xFE |
DECIMAL | As equiv. INT | As equiv. INT |
FLOAT | 0xFFAAAAAA | N/A |
DOUBLE | 0xFFFAAAAAAAAAAAAAULL | N/A |
DATE | 0xFFFFFFFE | N/A |
DATETIME | 0xFFFFFFFFFFFFFFFEULL | N/A |
CHAR/VARCHAR | Fill with '\0' | N/A |
struct Date { unsigned spare : 6; unsigned day : 6; unsigned month : 4; unsigned year : 16 };
The spare bits in the Date struct "must" be set to 0x3E.
struct DateTime { unsigned msecond : 20; unsigned second : 6; unsigned minute : 6; unsigned hour : 6; unsigned day : 6; unsigned month : 4; unsigned year : 16 };
<</style>>
© 2019 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/columnstore-bulk-data-loading/