1. A company wants to enrich application logs in near-real-time and use the enriched dataset for further analysis. The application is running on Amazon EC2 instances across multiple Availability Zones and storing its logs using Amazon CloudWatch Logs. The enrichment source is stored in an Amazon DynamoDB table.Which solution meets the requirements for the event collection and enrichment?
A) Use a CloudWatch Logs subscription to send the data to Amazon Kinesis Data Firehose. Use AWS Lambda to transform the data in the Kinesis Data Firehose delivery stream and enrich it with the data in the DynamoDB table. Configure Amazon S3 as the Kinesis Data Firehose delivery destination. B) Export the raw logs to Amazon S3 on an hourly basis using the AWS CLI. Use AWS Glue crawlers to catalog the logs. Set up an AWS Glue connection for the DynamoDB table and set up an AWS Glue ETL job to enrich the data. Store the enriched data in Amazon S3. C) Configure the application to write the logs locally and use Amazon Kinesis Agent to send the data to Amazon Kinesis Data Streams. Configure a Kinesis Data Analytics SQL application with the Kinesis data stream as the source. Join the SQL application input stream with DynamoDB records, and then store the enriched output stream in Amazon S3 using Amazon Kinesis Data Firehose. D) Export the raw logs to Amazon S3 on an hourly basis using the AWS CLI. Use Apache Spark SQL on Amazon EMR to read the logs from Amazon S3 and enrich the records with the data from DynamoDB. Store the enriched data in Amazon S3.
2. A banking company wants to collect large volumes of transactional data using Amazon Kinesis Data Streams for real-time analytics. The company usesPutRecord to send data to Amazon Kinesis, and has observed network outages during certain times of the day. The company wants to obtain exactly once semantics for the entire processing pipeline.What should the company do to obtain these characteristics?
A) Design the application so it can remove duplicates during processing be embedding a unique ID in each record. B) Rely on the processing semantics of Amazon Kinesis Data Analytics to avoid duplicate processing of events. C) Design the data producer so events are not ingested into Kinesis Data Streams multiple times. D) Rely on the exactly one processing semantics of Apache Flink and Apache Spark Streaming included in Amazon EMR.
3. A data analyst is using AWS Glue to organize, cleanse, validate, and format a 200 GB dataset. The data analyst triggered the job to run with the Standard worker type. After 3 hours, the AWS Glue job status is still RUNNING. Logs from the job run show no error codes. The data analyst wants to improve the job execution time without overprovisioning.Which actions should the data analyst take?
A) Enable job bookmarks in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the executor- cores job parameter. B) Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter. C) Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the spark.yarn.executor.memoryOverhead job parameter. D) Enable job bookmarks in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the num- executors job parameter.
4. A company has a business unit uploading .csv files to an Amazon S3 bucket. The company's data platform team has set up an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into the Amazon Redshift table.Which solution will update the Redshift table without duplicates when jobs are rerun?
A) Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class. B) Load the previously inserted data into a MySQL database in the AWS Glue job. Perform an upsert operation in MySQL, and copy the results to the Amazon Redshift table. C) Use Apache Spark's DataFrame dropDuplicates() API to eliminate duplicates and then write the data to Amazon Redshift. D) Use the AWS Glue ResolveChoice built-in transform to select the most recent value of the column.
5. A streaming application is reading data from Amazon Kinesis Data Streams and immediately writing the data to an Amazon S3 bucket every 10 seconds. The application is reading data from hundreds of shards. The batch interval cannot be changed due to a separate requirement. The data is being accessed by AmazonAthena. Users are seeing degradation in query performance as time progresses.Which action can help improve query performance?
A) Merge the files in Amazon S3 to form larger files. B) Increase the number of shards in Kinesis Data Streams. C) Add more memory and CPU capacity to the streaming application. D) Write the files to multiple S3 buckets.
1. Right Answer: A Explanation: They are already using CloudWatch Logs, it makes sense to send the CW logs to KFH which will invoke lambda and send data to S3 for further analysis.Referencehttps://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html#FirehoseExample
2. Right Answer: A Explanation: Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing.Referencehttps://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html
3. Right Answer: B Explanation: Referencehttps://docs.aws.amazon.com/glue/latest/dg/monitor-debug-capacity.html
4. Right Answer: A Explanation: Referencehttps://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
5. Right Answer: A Explanation: Merge the files in Amazon S3 to form larger files will definitely increase read performance.Referencehttps://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Leave a comment