1. A company wants to improve the data load time of a sales data dashboard. Data has been collected as .csv files and stored within an Amazon S3 bucket that is partitioned by date. The data is then loaded to an Amazon Redshift data warehouse for frequent analysis. The data volume is up to 500 GB per day.Which solution will improve the data loading performance?
A) Compress .csv files and use an INSERT statement to ingest data into Amazon Redshift. B) Split large .csv files, then use a COPY command to load data into Amazon Redshift. C) Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift. D) Load the .csv files in an unsorted key order and vacuum the table in Amazon Redshift.
2. A company has a data warehouse in Amazon Redshift that is approximately 500 TB in size. New data is imported every few hours and read-only queries are run throughout the day and evening. There is a particularly heavy load with no writes for several hours each morning on business days. During those hours, some queries are queued and take a long time to execute. The company needs to optimize query execution and avoid any downtime.What is the MOST cost-effective solution?
A) Enable concurrency scaling in the workload management (WLM) queue. B) Add more nodes using the AWS Management Console during peak hours. Set the distribution style to ALL. C) Use elastic resize to quickly add nodes during peak times. Remove the nodes when they are not needed. D) Use a snapshot, restore, and resize operation. Switch to the new target cluster.
3. A company stores its sales and marketing data that includes personally identifiable information (PII) in Amazon S3. The company allows its analysts to launch their own Amazon EMR cluster and run analytics reports with the data. To meet compliance requirements, the company must ensure the data is not publicly accessible throughout this process. A data engineer has secured Amazon S3 but must ensure the individual EMR clusters created by the analysts are not exposed to the public internet.Which solution should the data engineer to meet this compliance requirement with LEAST amount of effort?
A) Create an EMR security configuration and ensure the security configuration is associated with the EMR clusters when they are created. B) Check the security group of the EMR clusters regularly to ensure it does not allow inbound traffic from IPv4 0.0.0.0/0 or IPv6 ::/0. C) Enable the block public access setting for Amazon EMR at the account level before any EMR cluster is created. D) Use AWS WAF to block public internet access to the EMR clusters across the board.
4. A marketing company is using Amazon EMR clusters for its workloads. The company manually installs third-party libraries on the clusters by logging in to the master nodes. A data analyst needs to create an automated solution to replace the manual process.Which options can fulfill these requirements? (Choose two.)(Select 2answers)
A) Place the required installation scripts in Amazon S3 and execute them using custom bootstrap actions. B) Place the required installation scripts in Amazon S3 and execute them through Apache Spark in Amazon EMR. C) Install the required third-party libraries in the existing EMR master node. Create an AMI out of that master node and use that custom AMI to re-create the EMR cluster. D) Use an Amazon DynamoDB table to store the list of required applications. Trigger an AWS Lambda function with DynamoDB Streams to install the software. E) Launch an Amazon EC2 instance with Amazon Linux and install the required third-party libraries on the instance. Create an AMI and use that AMI to create the EMR cluster.
5. A company wants to research user turnover by analyzing the past 3 months of user activities. With millions of users, 1.5 TB of uncompressed data is generated each day. A 30-node Amazon Redshift cluster with 2.56 TB of solid state drive (SSD) storage for each node is required to meet the query performance goals.The company wants to run an additional analysis on a year's worth of historical data to examine trends indicating which features are most popular. This analysis will be done once a week.What is the MOST cost-effective solution?
A) Increase the size of the Amazon Redshift cluster to 120 nodes so it has enough storage capacity to hold 1 year of data. Then use Amazon Redshift for the additional analysis. B) Keep the data from the last 90 days in Amazon Redshift. Move data older than 90 days to Amazon S3 and store it in Apache Parquet format partitioned by date. Then use Amazon Redshift Spectrum for the additional analysis. C) Keep the data from the last 90 days in Amazon Redshift. Move data older than 90 days to Amazon S3 and store it in Apache Parquet format partitioned by date. Then provision a persistent Amazon EMR cluster and use Apache Presto for the additional analysis. D) Resize the cluster node type to the dense storage node type (DS2) for an additional 16 TB storage capacity on each individual node in the Amazon Redshift cluster. Then use Amazon Redshift for the additional analysis.
1. Right Answer: B Explanation: The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster.Referencehttps://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
2. Right Answer: A Explanation: Referencehttps://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.htmlhttps://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling.html
3. Right Answer: C Explanation: Referencehttps://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-block-public-access.html
4. Right Answer: A,E Explanation: Referenceshttps://aws.amazon.com/about-aws/whats-new/2017/07/amazon-emr-now-supports-launching-clusters-with-custom-amazon-linux-amis/ https://docs.aws.amazon.com/de_de/emr/latest/ManagementGuide/emr-plan-bootstrap.html
5. Right Answer: B Explanation: Amazon Redshift Spectrum executes queries across thousands of parallelized nodes to deliver fast results, regardless of the complexity of the query or the amount of data. Referencehttps://aws.amazon.com/redshift/features/
Leave a comment