S3 Data - how to read using Arcadia? do I need s3a?

Once you have data in S3 Arcadia enterprise platform can connect and read from it. But you need to consider a few things:

1. S3 vs s3a vs s3n - which protocol to use?

Data is data, you don’t have to change anything with your data in S3.

What matters is the protocol used to read it. Arcadia supports s3a it’s the replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema. You can read more here:

2. External table definition
In your external table definition you can force the use of s3a protocol by explicitly using “s3a” in the location URL as shown in this example

CREATE EXTERNAL TABLE table_s3 (
`src_file_nme` string,
`create_dts` timestamp)
 PARTITIONED BY (
    `file_date` string)
 STORED AS PARQUET
 LOCATION
    's3a://bucket path/raw_data/’; 

3. Create database with S3 specified
To specify that any tables created within a database reside on the Amazon S3 system, you can include an s3a:// prefix on the LOCATION attribute. This is helpful to force s3a across all table definitions within the database.

4. S3 ID & Secret Keys
Reading from S3 buckets requires access keys to be setup. You can do so by adding the below entries to the core-site.xml on your hadoop cluster.

  <property>
  <name>fs.s3a.awsAccessKeyId</name>
  <value><your access key></value>
  </property>

  <property>
  <name>fs.s3a.awsSecretAccessKey</name>
  <value><your secret key></value>
  </property>

If you are using a distribution of Hadoop like Cloudera Manager the settings would look like this: