Export data YCQL
This page documents the preview version (v2.23). Preview includes features under active development and is for development and testing only. For production, use the stable version (v2024.1). To learn more, see Versioning.
Bulk export is available for YugabyteDB's Cassandra-compatible YCQL API. To export data from a YugabyteDB or an Apache Cassandra table, you can use the cassandra-unloader
tool.
A typical workflow is to start with creating a source YugabyteDB table and populate it with data, then exporting the data using the cassandra-unloader
tool.
Create source table
The following example represents the schema of the destination YugabyteDB table:
CREATE KEYSPACE load;
USE load;
CREATE TABLE users(
user_id varchar,
score1 double,
score2 double,
points int,
object_id varchar,
PRIMARY KEY (user_id));
Generate sample data
You can generate data by executing a Python script, as per the following example:
# sample usage:
# To generate a 10GB (10240 MB) file.
# % python gen_csv.py <outfile_name> <outfile_size_MB>
# % python gen_csv.py file01.csv 10240
#
import numpy as np
import uuid
import csv
import os
import sys
outfile = sys.argv[1] # output file name
outsize_mb = int(sys.argv[2])
print("Outfile = " + outfile)
print("Outfile Size (MB) = " + str(outsize_mb))
chunksize = 10000
with open(outfile, 'ab') as csvfile:
while (os.path.getsize(outfile)//1024**2) < outsize_mb:
data = [[uuid.uuid4() for i in range(chunksize)],
np.random.random(chunksize)*1000,
np.random.random(chunksize)*50,
np.random.randint(1000000, size=(chunksize,)),
[uuid.uuid4() for i in range(chunksize)]]
csvfile.writelines(['%s,%.6f,%.6f,%i,%s\n' % row for row in zip(*data)])
The following shows sample rows generated by the script:
head file00.csv
3399bebc-d2cc-40c6-89d4-26102e08ff61,622.491927,40.262305,658257,44d73f8c-1d3c-424e-8fd2-d316c56b8454
4f362eac-f79f-45f6-b6b1-bd5a81f931dc,141.344278,3.024717,694290,7768b010-8411-490a-b523-88cc3ec53cb5
a24a6587-eea4-4907-ac7f-9f99dcac8f82,345.110599,3.869150,510943,5765d1d3-2855-4dbe-9f11-bb3b8631789f
...
To generate five CSV files of approximately 5 GB each, run the following commands:
python ./gen_csv.py file00.csv 5120 &
python ./gen_csv.py file01.csv 5120 &
python ./gen_csv.py file02.csv 5120 &
python ./gen_csv.py file03.csv 5120 &
python ./gen_csv.py file04.csv 5120 &
Load sample data
cassandra-loader
is a general-purpose bulk loader for CQL that supports various types of delimited files, particularly CSV files. For details, review the README file of the YugabyteDB cassandra-loader fork. Note that cassandra-loader
requires quotes for collection types (for example, "[1,2,3]" rather than [1,2,3] for lists).
You can install cassandra-loader
as follows:
wget https://github.com/yugabyte/cassandra-loader/releases/download/<latest-version>/cassandra-loader
chmod a+x cassandra-loader
You can run cassandra-loader
and queue up the files for upload one at a time, as follows:
./cassandra-loader \
-schema "load.users(user_id, score1, score2, points, object_id)" \
-boolStyle 1_0 \
-numFutures 1000 \
-rate 10000 \
-queryTimeout 65 \
-numRetries 10 \
-progressRate 200000 \
-host <clusterNodeIP> \
-f file01.csv
For additional options, refer to cassandra-loader options.
Export data
You can install cassandra-unloader
as follows:
wget https://github.com/brianmhess/cassandra-loader/releases/download/<latest-version>/cassandra-unloader
chmod a+x cassandra-unloader
You can run cassandra-unloader
as follows:
./cassandra-unloader \
-schema "load.users(user_id, score1, score2, points, object_id)" \
-boolStyle 1_0 \
-host <clusterNodeIP> \
-f outfile.csv
For additional options, refer to cassandra-unloader options.
Best practices
Be sure to always specify the time zone, as it is not added to the default timestamp formats when using the cassandra-loader
and cassandra-unloader
utilities. Timestamps must be exported and imported in the same format, including the time zone. For example, yyyy-MM-dd HH:mm:ss.SSSZ
and yyyy-MM-dd HH:mm:ss.SSSXXX
.
It is recommended to use tab character as delimiter on JSONB columns, as the default delimiter, comma ( ,
) does not work with these type of columns. For example, -delim $'\t'
.