Google Professional Data Engineer Quiz 1 Topic 3 Questions 1-5

Question: 1

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.

The data scientists have written the following code to read the data for a new key features in the logs.

BigQueryIO.Read

.named(â€œReadLogDataâ€)

.from(â€œclouddataflow-readonly:samples.log_dataâ€)

You want to improve the performance of this data read. What should you do?

ASpecify the TableReference object in the code.

BUse .fromQuery operation to read specific fields from the table.

CUse of both the Google BigQuery TableSchema and TableFieldSchema classes.

DCall a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.

Show Answer

Question: 2

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity â€˜Movieâ€™ the property â€˜actorsâ€™ and the property â€˜tagsâ€™ have multiple values but the property â€˜date releasedâ€™ does not. A typical query would ask for all movies with actor= ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

q2_Professional Data Engineer

AOption A

BOption B.

COption C

DOption D

Show Answer

Question: 3

Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

AThreading

BSerialization

CDropout Methods

DDimensionality Reduction

Show Answer

Question: 4

You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc? (Choose two.)

ARun the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.

BRun the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.

CRun the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.

DLeverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones.

ELoad the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.

Show Answer

Question: 5

You have uploaded 5 years of log data to Cloud Storage A user reported that some data points in the log data are outside of their expected ranges, which indicates errors You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons What should you do?

AImport the data from Cloud Storage into BigQuery Create a new BigQuery table, and skip the rows with errors.

BCreate a Compute Engine instance and create a new copy of the data in Cloud Storage Skip the rows with errors

CCreate a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in
Cloud Storage

DCreate a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to the same dataset in Cloud Storage

Show Answer

Google Professional Data Engineer Quiz:1 Topic:3 Questions:1-5