In this post, I introduce the merge function in BigQuery.
The use cases where we utilize it are mainly to eliminate data duplication. It is a powerful tool that allows us to manage and maintain data integrity. By merging data efficiently, we can ensure our datasets are free from redundancies.
Merge Query
Basic Query
MERGE dataset.master M
USING dataset.temp T
ON
M.id = T.id
WHEN MATCHED THEN
UPDATE SET
M.hoge = T.hoge,
M.fuga = T.fuga
WHEN NOT MATCHED THEN
INSERT ROW
With partition tables
if the table we merge is partitioned, we can’t use 「INSERT ROW」statement. If you use it, you can see the following error message.
「if the target is ingestion-time partitioned table, this won’t work you have to specify the columns:」
MERGE dataset.master M
USING dataset.temp T
ON
M.id = T.id
WHEN MATCHED THEN
UPDATE SET
M.hoge = T.hoge,
M.fuga = T.fuga
WHEN NOT MATCHED THEN
INSERT(
hoge,
fuga
)
VALUES(
hoge,
fuga
)
With One to many tables
If the relationship between the master and temp tables is one-to-many, we can use the composite key.
MERGE dataset.master M
USING dataset.temp T
ON
M.id = T.id
AND
M.status = T.status
WHEN MATCHED THEN
UPDATE SET
M.hoge = T.hoge,
M.fuga = T.fuga
WHEN NOT MATCHED THEN
INSERT ROW
How to improve the MERGE efficiency
We have three options to improve the MEREGE efficiency.
subquery filter
MERGE dataset.master M
USING (SELECT * FROM dataset.temp WHERE _PARTITIONTIME = '2018-01-01') T
ON M.id = T.id
WHEN MATCHED THEN
DELETE
search_condition
MERGE dataset.master M
USING dataset.temp T
ON M.id = T.id
WHEN MATCHED AND M._PARTITIONTIME = '2018-01-01' THEN
UPDATE SET id = S.id
WHEN MATCHED AND M._PARTITIONTIME = '2018-01-02' THEN
UPDATE SET price = S.price + 10
merge_condition
MERGE dataset.master M
USING dataset.temp T
ON M.id = T.id AND
M._PARTITIONTIME = '2018-01-01' AND T._PARTITIONTIME = '2018-01-01'
WHEN MATCHED THEN
UPDATE SET M.status= T.status