Inspirational journeys

Follow the stories of academics and their research expeditions

AWS Certified Data Analytics - Specialty - Part 6

Mary Smith

Thu, 15 May 2025

AWS Certified Data Analytics - Specialty - Part 6

1. An online retail company with millions of users around the globe wants to improve its ecommerce analytics capabilities. Currently, clickstream data is uploaded directly to Amazon S3 as compressed files. Several times each day, an application running on Amazon EC2 processes the data and makes search options and reports available for visualization by editors and marketers. The company wants to make website clicks and aggregated data available to editors and marketers in minutes to enable them to connect with users more effectively.Which options will help meet these requirements in the MOST efficient way? (Choose two.)(Select 2answers)

A) Use Amazon Kinesis Data Firehose to upload compressed and batched clickstream records to Amazon Elasticsearch Service.
B) Upload clickstream records to Amazon S3 as compressed files. Then use AWS Lambda to send data to Amazon Elasticsearch Service from Amazon S3.
C) Use Amazon Elasticsearch Service deployed on Amazon EC2 to aggregate, filter, and process the data. Refresh content performance dashboards in near-real time.
D) Use Kibana to aggregate, filter, and visualize the data stored in Amazon Elasticsearch Service. Refresh content performance dashboards in near-real time.
E) Upload clickstream records from Amazon S3 to Amazon Kinesis Data Streams and use a Kinesis Data Streams consumer to send records to Amazon Elasticsearch Service.


2. A company is streaming its high-volume billing data (100 MBps) to Amazon Kinesis Data Streams. A data analyst partitioned the data on account_id to ensure that all records belonging to an account go to the same Kinesis shard and order is maintained. While building a custom consumer using the Kinesis Java SDK, the data analyst notices that, sometimes, the messages arrive out of order for account_id. Upon further investigation, the data analyst discovers the messages that are out of order seem to be arriving from different shards for the same account_id and are seen when a stream resize runs.What is an explanation for this behavior and what is the solution?

A) There are multiple shards in a stream and order needs to be maintained in the shard. The data analyst needs to make sure there is only a single shard in the stream and no stream resize runs.
B) The hash key generation process for the records is not working correctly. The data analyst should generate an explicit hash key on the producer side so the records are directed to the appropriate shard accurately.
C) The records are not being received by Kinesis Data Streams in order. The producer should use the PutRecords API call instead of the PutRecord API call with the SequenceNumberForOrdering parameter.
D) The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards.



3. A media analytics company consumes a stream of social media posts. The posts are sent to an Amazon Kinesis data stream partitioned on user_id. An AWSLambda function retrieves the records and validates the content before loading the posts into an Amazon Elasticsearch cluster. The validation process needs to receive the posts for a given user in the order they were received. A data analyst has noticed that, during peak hours, the social media platform posts take more than an hour to appear in the Elasticsearch cluster.What should the data analyst do reduce this latency?

A) Migrate the validation process to Amazon Kinesis Data Firehose.
B) Migrate the Lambda consumers from standard data stream iterators to an HTTP/2 stream consumer.
C) Increase the number of shards in the stream.
D) Configure multiple Lambda functions to process the stream.



4. A company launched a service that produces millions of messages every day and uses Amazon Kinesis Data Streams as the streaming service.The company uses the Kinesis SDK to write data to Kinesis Data Streams. A few months after launch, a data analyst found that write performance is significantly reduced. The data analyst investigated the metrics and determined that Kinesis is throttling the write requests. The data analyst wants to address this issue without significant changes to the architecture.Which actions should the data analyst take to resolve this issue? (Choose two.)(Select 2answers)

A) Increase the Kinesis Data Streams retention period to reduce throttling.
B) Replace the Kinesis API-based data ingestion mechanism with Kinesis Agent.
C) Increase the number of shards in the stream using the UpdateShardCount API.
D) Choose partition keys in a way that results in a uniform record distribution across shards.
E) Customize the application code to include retry logic to improve performance.


5. A smart home automation company must efficiently ingest and process messages from various connected devices and sensors. The majority of these messages are comprised of a large number of small files. These messages are ingested using Amazon Kinesis Data Streams and sent to Amazon S3 using a Kinesis data stream consumer application. The Amazon S3 message data is then passed through a processing pipeline built on Amazon EMR running scheduled PySpark jobs.The data platform team manages data processing and is concerned about the efficiency and cost of downstream data processing. They want to continue to usePySpark.Which solution improves the efficiency of the data processing jobs and is well architected?

A) Send the sensor and devices data directly to a Kinesis Data Firehose delivery stream to send the data to Amazon S3 with Apache Parquet record format conversion enabled. Use Amazon EMR running PySpark to process the data in Amazon S3.
B) Set up an AWS Lambda function with a Python runtime environment. Process individual Kinesis data stream messages from the connected devices and sensors using Lambda.
C) Launch an Amazon Redshift cluster. Copy the collected data from Amazon S3 to Amazon Redshift and move the data processing jobs from Amazon EMR to Amazon Redshift.
D) Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue.



1. Right Answer: A,D
Explanation: After using Firehose to stream into Elasticsearch, use Kibana for the visualisations

2. Right Answer: D
Explanation: The parent shards that remain after the reshard could still contain data that you haven't read yet that was added to the stream before the reshard. If you read data from the child shards before having read all data from the parent shards, you could read data for a particular hash key out of the order given by the data records' sequence numbers. Therefore, assuming that the order of the data is important, you should, after a reshard, always continue to read data from the parent shards until it is exhausted. Only then should you begin reading data from the child shards.Referencehttps://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-after-resharding.html

3. Right Answer: C
Explanation: Since data is partitioned by user_id, we can safely assume that the data is more or less evenly distributed among the shards. Also, no other applications are mentioned in the question, so we have to assume that Lambda is the only source consuming the stream. During peak loads, the lambda process is so laggy that the data processing is an hour behind. Even if we move to enhanced fanout with stream processing to HTTP/2 consumers, it may not improve the read throughput to catch up to 1 hour lag. A better option is to increase the shards which will increase the read throughput and improve latency.

4. Right Answer: C,D
Explanation: A few months after launch, a data analyst found that write performance is significantly reduced means the load is increasing constantly and will not decrease, and it's not a temporary spike. So we are facing a lack of shard or bad partition key is burning some shards over time

5. Right Answer: A
Explanation: A improves ingestion/storage efficiency by levaraging Firehose. This is clearly an option which 'improves the efficiency of the data processing jobs and is well architected' while 'continue to use PySpark'.

0 Comments

Leave a comment