Tech Truffle - Data and Cloud Solutions

I successfully passed the AWS Certified Big Data Specialty certification on 20th January, 2020. I already hold the AWS Solutions Architect Associate certification and if I were to compare both these exams, I would definitely rate the specialty way higher in terms of difficulty and complexity. It took me more than two months to prepare for the exam (with last three weeks of full time study). Preparation time can vary based on your understanding of Big Data tools.

In this post, I will share the details of the study material I found helpful for the preparation and some tips along the way. Let us get into the detailed sections one by one.

Section 1: AWS Recommended Material

The best way to start your preparation is to get to know what AWS recommends to study and what it expects you to know before you appear in the exam. I highly recommend going through all four points below. I personally completed the 3rd and 4th part after I finished the self-paced courses discussed in the Section 2.

The official AWS Exam Guide.
The official AWS Sample Exam Questions. It will give you fair idea on what kind of questions to expect. You will find the answers at the end of the document.
The official Data Analytics Fundamental Training Course. It is a free course and you can access it with your AWS Training and Certification account.
The official Exam Readiness Course. It is also a free course and you can access it with your AWS Training and Certification account.

Section 2: Self-Paced Courses

I started with A Cloud Guru’s course on Big Data Specialty. This course covers almost 80% of the exam content. I believe they are working on updating it. I really liked the detailed explanation of Redshift and Dynamo DB sections.
I also took the second course AWS Certified Data Analytics Specialty. This course is the most up-to-date course related to the exam topics. You also get a full practice test at the end.
During my preparation, I also found the AWS Certified Big Data Specialty course on Linux Academy. Although I did not fully complete it since I was focusing on appearing in the exam very soon. The website also offers a 7-day trial period and you can efficiently utilize it to finish the course (at least solve the full practice test within the course – highly recommended)

I would strongly recommend doing at least two out of these courses as the style of teaching is different in each course and you will learn new things for the same topic. Eventually the purpose of doing a certification is to improve your knowledge. If you are also on a clock with less time and want to choose one then go with the second one.

Very Important -> Do create notes on each topic while you study, they come in really handy for revision at later stages as there is quite a lot of stuff to remember.

Section 3: Whitepapers and re-Invent Videos

During the initial phase of my preparation, I took this section a little less seriously but as I was progressing my research on how others have prepared for this exam, I found that everyone was emphasizing about reading white papers and watching some of the re-Invent videos. I highly recommend going through the content listed below at least once!

Section 4: Hands on Experience

In my experience, it is one of the most important thing to do before you appear in the exam. Create an AWS free tier account and practice the labs no matter which course you are following from section 2.

The exam covers a wide range of topics/services and each of them have many small but important features. If you skip the labs then there are high chances that you either mix those key features or might totally forget some of them. Therefore, make sure you do the labs in the course. Remember to shut down the services once finished with the tasks. Otherwise, you may incur heavy costs i.e. if you leave an EMR cluster running for hours. Usually you will incur around five to ten dollars for doing all the labs.

Section 5: Time Management and Answering Approach

The exam gives you 170 minutes to answer 65 questions. Most of the questions are scenario based and span between three to six lines. Sometimes the answers are also more than a line. It breaks down to 2 minutes and 37 seconds for each question, which is not a lot to read large questions and find the right answer. Some tips:

Read the questions fast, note the key points i.e. real-time, near real-time, cost-optimized.
While reading answers, try to discard the most obvious wrong answers i.e. a question demands a real-time solution, here you can discard the options suggesting a data-pipeline with batch options. In most cases, you can easily discard two options. The job is to find the best one from the remaining two options. Read the question again, and this time, try to find some keywords/hints to eliminate the third wrong option.
Do not forget to read the last line(s) of the questions again. Sometime, a question contains a scenario in a large text about a combination of AWS services that are used but at the end, a cost-optimized solution is required. Therefore, in this case you should choose the option that solves the problem with lowest costs.
In case you are stuck on a question, do not spend more time on it. If you have already spent two minutes on a question and still no clue about the answer, just mark the question using the flag button for later review. Use the time efficiently to complete other questions first. Remember, time management is extremely important.

Section 6: Mock Tests and Practice Exams

Do solve practice exams to understand what type of question can come in the exam and how far are you with your preparation. It will also give you an idea on the average time you take to understand a question and answer it. I solved the practice exams available in the Udemy and Linux Academy courses (discussed in section 2) and a Free test sample on Whizlabs, which is also good for practice (but very lengthy questions).

Section 7: Cheat Sheet on Most Important Services

Amazon S3

Storage classes: S3 Standard, S3 Standard-Infrequent Access (IA), S3 One Zone-Infrequent Access, Glacier (Vault Access policy and Vault lock policy)
Access Control Lists (ACL): allow read/write access to both the objects in the bucket and the permissions to the object
Lifecycle management: Moving data through the storage tiers e.g. ‘S3 standard to IA’ and ‘IA to Glacier’ for cost savings

Amazon Kinesis

Kinesis Data Streams, Kinesis Firehose (delivery streams), Kinesis Analytics, Kinesis Video Streams (specially differences between Kinesis Data Streams and Kinesis Firehose, their read-write limits and use-cases)
Real-time vs Near Real-Time vs Batch
Kinesis Producers Library (KPL), Kinesis SDK, Kinesis Agent, Kinesis Client Library
Scenarios to improve performance (sharding merge and split)
Kinesis Enhanced Fan-out
KPL check pointing (Dynamo DB table and its limits, increase them for performance)

Amazon EMR

Good understanding of HUE, Spark, Spark streaming, Flume, Zeppelin, Kafka, Pig, Mahout, Phoenix, SparkMLlib (what additional use-cases can be performed which Amazon Machine Learning Service cannot offer)
Detailed understanding of HIVE and Presto (their use-cases and differences)
Hive metastore and options to create a metastore in RDS or use AWS Glue instead
Master, Core and Task Node (detailed differences i.e. Task nodes are used to save costs)
Transient EMR Clusters to save costs
EMRFS vs HDFS
EMRFS consistent view
EMR best practices
S3DistCp command to copy huge data files

Amazon Dynamo DB

Use cases (milli-second latency at scale, gaming, bidding apps, voting apps, real-time)
Capacity requirements (performance based, storage based)
Write Capacity Units / Read Capacity Units (the formulas)
Partitioning (how and when partitions are created)
Understand that partitioning does not increase performance, having a good partition key does. A good partition key distributes data across all partitions
Partition Key and Sort Key (how to choose them, composite key concept)
Hot Partition Key (how to solve: Burst Capacity or Write Sharding to distribute rows across partition by using Random Suffix or Calculated Suffix)
Secondary Indexes (Local and Global, their differences, when to use what)
Dynamo DB Accelerator (DAX)
Dynamo DB Streams and its use-cases (replication)

Amazon Redshift and Redshift Spectrum

You need to know how it works in detail i.e. cluster architecture, slices
Distribution Styles (Key, Even and All. You need to know them inside out when to use what and what are the benefits of each)
Sort Keys: types and their benefits (also study the concept of Zone-Maps)
Data Compression (high level differences between Gzip, Bzip2, Snappy)
Load Data to Redshift i.e. the famous COPY command which is everywhere, splitting data
Load encrypted data i.e. what is supported and what is not
UNLOAD Command and its usage
Workload Management: creating queues to divide and isolate workloads
Views: create views to restrict access
Operations: Back up, snapshots, restore (full cluster, one table)
Vacuum Command and its types
Deep Copy (its differences with Vacuum Command)
Manifest file and its usage
Redshift Spectrum: Exabyte scale query engine that can query data in S3, read its use-cases

AWS Elastic Search

Use cases i.e. search, logs and analysis, real-time apps monitoring, schema free JSON documents
Logstash integration
Kibana integration (Cognito for Kibana login)
Domain i.e. what is it and what does it do
Indices and Shards
Zone Awareness and Replica shards

AWS Glue

ETL tool, serverless discovery of table and schema
Crawlers to infer schema on the go
Data Catalog (can replace Hive Metastore)

Data Pipeline

Understand where it can be used, why it used
High-level knowledge of its components i.e. Data nodes, Activities, Pre-conditions, Schedules

SQS

Use-cases: Notifications, batch related pipelines, decoupling services
Only one reader per queue
Data cannot be consumed more than once (deleted after it is consumed, no reply option)

Data Migration Service

Use-cases: Homogenous and heterogeneous migrations (same source/destination, different source/destination)
Source remain operational during migration (minimized downtime)
One-time Migration / Continuous replication

Amazon Athena

Interactive SQL query service for S3 (no data load required, can directly read from S3)
CSV, JSON, ORC, Parquet, Avro
Support structured, semi-structured, non-structures data
Use-cases: ad-hoc query
PByte level scale
Save costs if data it queries is stored in ORC, Parquet.

Amazon Quicksight

Use cases i.e. BI tool used for Visualization
What is SPICE
Understand measures and dimensions
You need to know all types of Visualization Quicksight offers. Simple questions and a section where you can score 100% points if you know when to use which graphs, charts i.e. Bar, Line, Pivot table, Heat Maps, KPIs, Pie-charts, Tree-Maps
AutoGraph
Analysis, stories and dashboards
D3.js, Chart.js and Highchart.js (Not related to Quicksight but related to the overall Visualization section)
Jupyter Notebooks (also not related to Quicksight)

Amazon Machine Learning Service - AML (although its deprecated but still valid for the exam)

Supervised Learning vs Unsupervised (AML covers on supervised)
Algorithms it covers i.e. binary and multi classification models, regression
Use cases for Binary, Multi and Regression models

AWS IoT Service

Authentication and Authorization (X509 certificates, Cognito for mobile users)
Control Plane, Data Plane
Rule Engine, Device Gateway, Device Registry, Device Shadow

Security (Remember it’s job zero as recommended by AWS)

This is one of the most important section. Do not take this one easy. The exam consist of 20% questions from this section

Encryption At-rest and In-transit (this is applicable for all services listed in this section)
KMS Service: how does it work, understand the encryption process
S3: Supported Encryption types (server side encryption, client side encryption, their differences and use-cases), S3 VPC Endpoints, Cloud HSM
EMR: possibilities for at-rest encryption (EBS encryption, open source HDFS encryption, LUKS, EMRFS on S3), possibilities for in-transit (EMRFS is auto enabled, TLS, node to node encryption), Apache Ranger and its role-based access controls
Kinesis (encrypting an end to end stream i.e. encrypting all data before it enters producer)
Securing IoT, Dynamo DB, Redshift, Lambda and all others.
STS Service for temporary access without IAM accounts
Federation and policies (i.e. use Microsoft Active directory accounts to login)
Cross Account Access
VPC Endpoints for secure access between private network and AWS services

Section 8: Helpful Posts and FAQs

Do your research in finding more tips on social media and blogs. I found the below posts helpful for my preparation:

In addition, if you have time, go through the FAQs of each individual service. During preparation, you might come across many random questions and FAQs link might become a handy tool for you.

Conclusion

It is indeed a difficult exam, not to scare anyone but to prepare and motivate. If you are well prepared, you will definitely ace it. The hands on experience is very important. Do not skip the labs specially Redshift, Dynamo DB, Kinesis and EMR. If you are already familiar with Big Data technologies then it will be easier for you to understand the landscape of Big Data technologies offered by AWS, how they connect together building complex data lakes, data pipelines, data warehouse and data analytics solutions.

I hope it helps you in your preparation. If you have any questions, reach out. I wish you all the best for your exam.

Happy Learning!

How to pass the AWS Data Analytics Specialty Certification