ablog (original) (raw)

Redshift ユーザーによって COPY/UNLOAD でアクセスできる S3 のフォルダ(プレフィックス)を分ける方法をメモ。

実現すること

検証結果

Redshift ユーザー "rs_user_a" は S3 パス "s3:///rs_user_a/" 以下のみに COPY/UNLOAD でアクセスできる。

$ psql "host=redshift-cluster-poc-central.********.ap-northeast-1.redshift.amazonaws.com user=rs_user_a dbname=dev port=5439"

dev=> unload ('select * from public.customer') dev-> to 's3://s3-for-redshift/rs_user_a/customer' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a' dev-> header dev-> format csv dev-> gzip dev-> parallel off dev-> allowoverwrite; INFO: UNLOAD completed, 10125500 record(s) unloaded successfully. UNLOAD

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_a/customer000.gz' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a' dev-> format csv dev-> gzip compupdate off region 'ap-northeast-1' dev-> ignoreheader 1; INFO: Load into table 'customer' completed, 10125500 record(s) loaded successfully. COPY

dev=> unload ('select * from public.customer') dev-> to 's3://s3-for-redshift/rs_user_b/customer' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a' dev-> header dev-> format csv dev-> gzip dev-> parallel off dev-> allowoverwrite; ERROR: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic DETAIL:

error: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic code: 8001 context: Failed to initialize S3 output stream. S3 path: s3://s3-for-redshift/rs_user_b/customer000.gz query: 35041819 location: s3_text_unloader.cpp:369 process: padbmaster [pid=1073750164]

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_b/customer000.gz' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a' dev-> format csv dev-> gzip compupdate off region 'ap-northeast-1' dev-> ignoreheader 1;

ERROR: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic DETAIL:

error: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic code: 8001 context: S3 key being read : s3://s3-for-redshift/rs_user_b/customer000.gz query: 35482475 location: copy_s3_scanner.cpp:344 process: query1_252_35482475 [pid=15960]

Redshift ユーザー "rs_user_b" は S3 パス "s3:///rs_user_b/" 以下のみに COPY/UNLOAD でアクセスできる。

$ psql "host=redshift-cluster-poc-central.********.ap-northeast-1.redshift.amazonaws.com user=rs_user_b dbname=dev port=5439"

dev=> unload ('select * from public.customer') dev-> to 's3://s3-for-redshift/rs_user_b/customer' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b' dev-> header dev-> format csv dev-> gzip dev-> parallel off dev-> allowoverwrite; INFO: UNLOAD completed, 20251000 record(s) unloaded successfully. UNLOAD

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_b/customer000.gz' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b' dev-> format csv dev-> gzip compupdate off region 'ap-northeast-1' dev-> ignoreheader 1; INFO: Load into table 'customer' completed, 20251000 record(s) loaded successfully. COPY

dev=> unload ('select * from public.customer') dev-> to 's3://s3-for-redshift/rs_user_a/customer' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b' dev-> header dev-> format csv dev-> gzip dev-> parallel off dev-> allowoverwrite;

ERROR: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic DETAIL:

error: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic code: 8001 context: Failed to initialize S3 output stream. S3 path: s3://s3-for-redshift/rs_user_a/customer000.gz query: 35483157 location: s3_text_unloader.cpp:369 process: padbmaster [pid=1073988131]

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_a/customer000.gz' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b' dev-> format csv dev-> gzip compupdate off region 'ap-northeast-1' dev-> ignoreheader 1;

ERROR: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic DETAIL:

error: S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic code: 8001 context: S3 key being read : s3://s3-for-redshift/rs_user_a/customer000.gz query: 35483161 location: copy_s3_scanner.cpp:344 process: query1_252_35483161 [pid=15960]

GRANT されていない IAM ロールには ASSUMEROLE できない

$ psql "host=redshift-cluster-poc-central.ceyg6jv96hfq.ap-northeast-1.redshift.amazonaws.com user=rs_user_a dbname=dev port=5439" dev=> copy public.customer from 's3://s3-for-redshift/rs_user_b/customer000.gz' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b' dev-> format csv dev-> gzip compupdate off region 'ap-northeast-1' dev-> ignoreheader 1; ERROR: User rs_user_a does not have ASSUMEROLE permission on IAM role "arn:aws:iam::123456789012:role/iam-role_for_rs_role_b" for COPY

$ psql "host=redshift-cluster-poc-central.ceyg6jv96hfq.ap-northeast-1.redshift.amazonaws.com user=rs_user_b dbname=dev port=5439" dev=> copy public.customer from 's3://s3-for-redshift/rs_user_a/customer000.gz' dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a' dev-> format csv dev-> gzip compupdate off region 'ap-northeast-1' dev-> ignoreheader 1; ERROR: User rs_user_b does not have ASSUMEROLE permission on IAM role "arn:aws:iam::123456789012:role/iam-role_for_rs_role_a" for COPY

設定手順

revoke assumerole on all from public for all;

ASSUMEROLE のための IAMロール "iam-role_for_rs_role_a" を作成する。

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::s3-for-redshift/rs_user_a/", "arn:aws:s3:::s3-for-redshift/rs_user_a/" ] } ] }

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "redshift.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }

ASSUMEROLE のための IAMロール "iam-role_for_rs_role_b" を作成する。

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::s3-for-redshift/rs_user_b/", "arn:aws:s3:::s3-for-redshift/rs_user_b/" ] } ] }

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "redshift.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }

CREATE USER rs_user_a PASSWORD 'Password123!'; CREATE USER rs_user_b PASSWORD 'Password123!'; grant usage on schema public to rs_user_a, rs_user_b; grant all on all tables in schema public to rs_user_a, rs_user_b;

GRANT ASSUMEROLE ON 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a' TO rs_user_a FOR COPY, UNLOAD;

GRANT ASSUMEROLE ON 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b' TO rs_user_b FOR COPY, UNLOAD;

設定したロールを確認する

dev=# select pg_get_iam_role_by_user('rs_user_a'); pg_get_iam_role_by_user

(rs_user_a,all,None) (rs_user_a,arn:aws:iam::123456789012:role/iam-role_for_rs_role_a,COPY|UNLOAD) (2 rows)

dev=# select pg_get_iam_role_by_user('rs_user_b'); pg_get_iam_role_by_user

(rs_user_b,all,None) (rs_user_b,arn:aws:iam::123456789012:role/iam-role_for_rs_role_b,COPY|UNLOAD) (2 rows)

補足

grant assumerole on all to public for all;

参考

ASSUMEROLE アクセス許可の付与
指定されたロールを持つユーザーおよびグループに付与される ASSUMEROLE アクセス許可の構文を次に示します。ASSUMEROLE 権限の使用を開始する際は、「ASSUMEROLE アクセス許可を付与するための使用上の注意事項」を参照してください。

GRANT ASSUMEROLE ON { 'iam_role' [, ...] | default | ALL } TO { username | ROLE role_name | GROUP group_name | PUBLIC } [, ...] FOR { ALL | COPY | UNLOAD | EXTERNAL FUNCTION | CREATE MODEL } [, ...]

GRANT - Amazon Redshift

Redshift で UNLOAD した時のファイル suffix の命名規則

ネーミングルール

検証手順

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' csv parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' csv gzip parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' csv zstd parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' csv bzip2 parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' parquet parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' json gzip parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' json bzip2 parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' json zstd parallel off;

unload ('select * from public.customer') to 's3://test-rs-copy-bucket/unload/customer_' allowoverwrite iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess' json parallel off;

COPY コマンドで Redshift にデータロード時に無効な UTF-8 文字がある場合、ACCEPTINVCHARS で固定の文字列に置換することができる。

ACCEPTINVCHARS [AS] ['replacement_char']
データに無効な UTF-8 文字がある場合でも、VARCHAR 列へのデータのロードを有効にします。ACCEPTINVCHARS を指定した場合、COPY は replacement_char で指定されている文字列から構成される同じ長さの文字列で、無効な各 UTF-8 文字を置き換えます。たとえば、置換文字が '^' である場合、無効な 3 バイト文字は '^^^' で置き換えられます。

置換文字には NULL 以外の任意の ASCII 文字を使用できます。デフォルトは疑問符 (?) です。無効な UTF-8 文字の詳細については、「マルチバイト文字のロードエラー」を参照してください。

COPY は無効な UTF-8 文字を含んだ行の数を返し、対象行ごとに STL_REPLACEMENTS システムテーブルにエントリを追加します (各ノードスライスで最大 100 行まで)。さらに多くの無効な UTF-8 文字も置き換えられますが、それらの置換イベントは記録されません。

ACCEPTINVCHARS を指定しなかった場合、無効な UTF-8 文字があるごとに、COPY はエラーを返します。

ACCEPTINVCHARS は VARCHAR 列に対してのみ有効です。

データ変換パラメータ - Amazon Redshift

Redshift で COPY コマンド実行時に svl_query_metrics_summary や sys_query_history に情報が記録されることを確認したメモ。

結果

dev=# copy public.customer from 's3://test-rs-copy-bucket/csv/customer/' csv dev-# iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'; INFO: Load into table 'customer' completed, 9000000 record(s) loaded successfully. COPY Time: 8538.877 ms (00:08.539) dev=# select pg_last_query_id(); pg_last_query_id

     33416134

(1 row)

Time: 3.421 ms dev=# \gset Time: 3.452 ms dev=# \x Expanded display is on. dev=# select * from svl_query_metrics_summary where query = :pg_last_query_id; -[ RECORD 1 ] userid | 100 query | 33416134 service_class | 102 query_cpu_time | 14 query_blocks_read | 160 query_execution_time | 8 query_cpu_usage_percent | 40.07 query_temp_blocks_to_disk | segment_execution_time | 3 cpu_skew | 1.91 io_skew | 1.00 scan_row_count | 8850432 join_row_count | nested_loop_join_row_count | return_row_count | spectrum_scan_row_count | spectrum_scan_size_mb | query_queue_time | service_class_name | Default queue

Time: 682.883 ms dev=# select * from sys_query_history where transaction_id = (select xid from stl_query where query = :pg_last_query_id); -[ RECORD 1 ] user_id | 100 query_id | 33416133 query_label | default transaction_id | 93744987 session_id | 1073889466 database_name | dev query_type | COPY status | success
result_cache_hit | f start_time | 2024-09-06 00:04:43.957806 end_time | 2024-09-06 00:04:51.574099 elapsed_time | 7616293 queue_time | 0 execution_time | 7491272 error_message | returned_rows | 0 returned_bytes | 0 query_text | copy public.customer from 's3://test-rs-copy-bucket/csv/customer/' csv\niam_role ''; redshift_version | 1.0.73348
usage_limit | compute_type | primary compile_time | 6539232 planning_time | 0 lock_wait_time | 36

Time: 790.980 ms

手順

create table public.customer ( c_custkey integer not null, c_name varchar(25) not null, c_address varchar(25) not null, c_city varchar(10) not null, c_nation varchar(15) not null, c_region varchar(12) not null, c_phone varchar(15) not null, c_mktsegment varchar(10) not null ) diststyle even compound sortkey(c_nation,c_region);

copy public.customer from 's3://test-rs-copy-bucket/csv/customer/' csv iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess';

select pg_last_query_id(); \gset

\x select * from svl_query_metrics_summary where query = :pg_last_query_id; select * from sys_query_history where transaction_id = (select xid from stl_query where query = :pg_last_query_id);

Redshift の STL テーブルのクエリID(query列)と SYS ビューのクエリID(query_id)は別の値のため、クエリIDで結合することはできない。トランザクションID(STLとSYSのトランザクションIDが 1:1 の場合)で結合することができる。

dev=# \timing on dev=# select count(distinct(lo_orderkey)) from lineorder; count

150000000 (1 row)

Time: 13282.862 ms (00:13.283) dev=# select pg_last_query_id(); pg_last_query_id

     33269125

(1 row)

Time: 3.632 ms dev=# \gset Time: 3.703 ms dev=# \x Expanded display is on. dev=# select * from sys_query_history where transaction_id = (select xid from stl_query where query = :pg_last_query_id); -[ RECORD 1 ] user_id | 100 query_id | 33269123 query_label | default transaction_id | 93596628 session_id | 1073922183 database_name | dev query_type | SELECT status | success result_cache_hit | f start_time | 2024-09-05 04:25:48.409507 end_time | 2024-09-05 04:26:01.689166 elapsed_time | 13279659 queue_time | 0 execution_time | 13263519 error_message | returned_rows | 1 returned_bytes | 15 query_text | select count(distinct(lo_orderkey)) from lineorder; redshift_version | 1.0.73348 usage_limit | compute_type | primary compile_time | 155 planning_time | 6009 lock_wait_time | 22

Time: 378.407 ms

参考

SYS monitoring views such as such as SYS_QUERY_HISTORY and SYS_QUERY_DETAIL contain the query_id column, which holds the identifier for users’ queries. Similarly, provisioned-only views such as STL_QUERY and SVL_QLOG contain the query column, which also holds the query identifiers. However, the query identifiers recorded in the SYS system views are different from those recorded in the provisioned-only views.

The difference between the SYS views’ query_id column values and the provisioned-only views’ query column values is as follows:

In SYS views, the query_id column records user-submitted queries in their original form. The Amazon Redshift optimizer might break them down into child queries for improved performance, but a single query you run will still only have a single row in SYS_QUERY_HISTORY. If you want to see the individual child queries, you can find them in SYS_QUERY_DETAIL.

In provisioned-only views, the query column records queries at the child query level. If the Amazon Redshift optimizer rewrites your original query into multiple child queries, there will be multiple rows in STL_QUERY with differing query identifier values for a single query you run.

When you migrate your monitoring and diagnostic queries from provisioned-only views to SYS views, consider this difference and edit your queries accordingly. For more information on how Amazon Redshift processes queries, see Query planning and execution workflow.

System tables and views reference - Amazon Redshift

STL STV SVL SVV SYS
タイプ テーブル テーブル ビュー ビュー ビュー
生成方法 ディスク上のログ オンメモリーデータ STLへの参照 STVへの参照 -
用途 過去の実行記録の参照 現在進行中の処理の参照 STL/STVデータを組み合わせて別軸で分析 同左 -
記録タイミング 実行直後* - 実行中 - -
保持期間 7日間 - 7日間 - 7日間

参考

STL システムビューは 7 日間のログ履歴を保持します。ログの保持は、すべてのクラスターサイズとノードタイプで保証されており、クラスターワークロードの変化による影響を受けません。また、ログの保持は、クラスターの一時停止などのクラスターの状態からも影響を受けません。クラスターが新しい場合のみ、ログ履歴が 7 日未満になります。ログを保持するために必要なアクションはありませんが、7 日以上前のログデータを保持するには、ログを定期的に他のテーブルにコピーするか、Amazon S3 にアンロードする必要があります。

ログ記録のための STL ビュー - Amazon Redshift

dev=# select nspname from pg_namespace where nspowner = 1; nspname

pg_toast pg_internal pg_automv pg_temp_1 pg_catalog information_schema catalog_history public pg_temp_7 pg_temp_8 pg_temp_9 pg_temp_5 pg_temp_6 pg_temp_11 pg_auto_copy pg_s3 pg_mv (17 rows)

dev=# select distinct(split_part(tablename,'_',1)) from pg_tables where schemaname = 'pg_catalog'; split_part

padb pg stcs stll stv systable (6 rows)

dev=# select distinct(split_part(viewname,'_',1)) from pg_views where schemaname = 'pg_catalog'; split_part

pg stl svcs svl svv sys (6 rows)

$ psql "host=redshift-cluster-poc-central.********.ap-northeast-1.redshift.amazonaws.com user=awsuser dbname=dev port=5439" psql (13.7, server 8.0.2) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type "help" for help.

dev=# \pset pager Pager usage is off. dev=# select version(); version

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.73348 (1 row)

dev=# set enable_result_cache_for_session=off; SET dev=# \timing on Timing is on. dev=# select count(*) from lineorder; count

1200075804 (1 row)

Time: 31.825 ms dev=# select pg_last_query_id(); pg_last_query_id

     33268129 

(1 row)

Time: 3.663 ms ★クエリの実行時間は約3.6ミリ秒 dev=# \gset Time: 3.583 ms dev=# select userid,query,query_execution_time,query_blocks_read from svl_query_metrics_summary where query = :pg_last_query_id; userid | query | query_execution_time | query_blocks_read --------+-------+----------------------+------------------- (0 rows) ★記録されていない

Time: 243.333 ms dev=# select count(distinct(lo_orderkey)) from lineorder; count

150000000 (1 row)

Time: 11748.946 ms (00:11.749) dev=# select pg_last_query_id(); pg_last_query_id

     33268148

(1 row)

Time: 3.521 ms ★クエリの実行時間は3.5秒 dev=# \gset Time: 3.580 ms dev=# select userid,query,query_execution_time,query_blocks_read from svl_query_metrics_summary where query = :pg_last_query_id; userid | query | query_execution_time | query_blocks_read --------+----------+----------------------+------------------- 100 | 33268148 | 12 | 4269 (1 row) ★記録されている

Time: 230.942 ms

今更だけど、社内外でよく2023年のAWSサミット東京で発表した "Amazon Redshift クエリパフォーマンスチューニング Deep Dive" のスライドのパスをよく聞かれるのでメモしておく。

発表後に SYS_QUERY_HISTORY など便利なビューも増えている。