aiotestking uk

Professional-Data-Engineer Exam Questions - Online Test


Professional-Data-Engineer Premium VCE File

Learn More 100% Pass Guarantee - Dumps Verified - Instant Download
150 Lectures, 20 Hours

Printable of Professional-Data-Engineer pdf exam materials and exam price for Google certification for IT specialist, Real Success Guaranteed with Updated Professional-Data-Engineer pdf dumps vce Materials. 100% PASS Google Professional Data Engineer Exam exam Today!

Also have Professional-Data-Engineer free dumps questions for you:

NEW QUESTION 1

You have a data pipeline with a Cloud Dataflow job that aggregates and writes time series metrics to Cloud Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. Which two actions should you take? (Choose two.)

  • A. Configure your Cloud Dataflow pipeline to use local execution
  • B. Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions
  • C. Increase the number of nodes in the Cloud Bigtable cluster
  • D. Modify your Cloud Dataflow pipeline to use the Flatten transform before writing to Cloud Bigtable
  • E. Modify your Cloud Dataflow pipeline to use the CoGroupByKey transform before writing to Cloud Bigtable

Answer: DE

NEW QUESTION 2

Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face. To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?

  • A. Use K-means Clustering to detect faces in the pixels.
  • B. Use feature engineering to add features for eyes, noses, and mouths to the input data.
  • C. Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.
  • D. Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two categories.

Answer: C

Explanation:
Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep is a strictly defined, technical term that means more than one hidden layer.
In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the
previous layer.
A neural network with only one hidden layer would be unable to automatically recognize high-level features of faces, such as eyes, because it wouldn't be able to "build" these features using previous hidden layers that detect low-level features, such as lines.
Feature engineering is difficult to perform on raw image data.
K- means Clustering is an unsupervised learning method used to categorize unlabeled data. Reference: https://deeplearning4j.org/neuralnet-overview

NEW QUESTION 3

You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?

  • A. Convert all daily log tables into date-partitioned tables
  • B. Convert the sharded tables into a single partitioned table
  • C. Enable query caching so you can cache data from previous months
  • D. Create separate views to cover each month, and query from these views

Answer: A

NEW QUESTION 4

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

  • A. Store the common data in BigQuery as partitioned tables.
  • B. Store the common data in BigQuery and expose authorized views.
  • C. Store the common data encoded as Avro in Google Cloud Storage.
  • D. Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Answer: B

NEW QUESTION 5

You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the Data Science team runs a query filtered on a date column and limited to 30–90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries. What should you do?

  • A. Re-create the tables using DD
  • B. Partition the tables by a column containing a TIMESTAMP or DATE Type.
  • C. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
  • D. Modify your pipeline to maintain the last 30–90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
  • E. Write an Apache Beam pipeline that creates a BigQuery table per da
  • F. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.

Answer: C

NEW QUESTION 6

You are designing a cloud-native historical data processing system to meet the following conditions:
Professional-Data-Engineer dumps exhibit The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
Professional-Data-Engineer dumps exhibit A streaming data pipeline stores new data daily.
Professional-Data-Engineer dumps exhibit Peformance is not a factor in the solution.
Professional-Data-Engineer dumps exhibit The solution design should maximize availability.
How should you design data storage for this solution?

  • A. Create a Cloud Dataproc cluster with high availabilit
  • B. Store the data in HDFS, and peform analysis as needed.
  • C. Store the data in BigQuer
  • D. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
  • E. Store the data in a regional Cloud Storage bucke
  • F. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
  • G. Store the data in a multi-regional Cloud Storage bucke
  • H. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.

Answer: C

NEW QUESTION 7

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

  • A. Change the processing job to use Google Cloud Dataproc instead.
  • B. Manually start the Cloud Dataflow job each morning when you get into the office.
  • C. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
  • D. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Answer: C

NEW QUESTION 8

You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users’ privacy?

  • A. Grant the consultant the Viewer role on the project.
  • B. Grant the consultant the Cloud Dataflow Developer role on the project.
  • C. Create a service account and allow the consultant to log on with it.
  • D. Create an anonymized sample of the data for the consultant to work with in a different project.

Answer: C

NEW QUESTION 9

You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

  • A. Cloud Scheduler
  • B. Cloud Dataflow
  • C. Cloud Functions
  • D. Cloud Composer

Answer: A

NEW QUESTION 10

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

  • A. Include ORDER BY DESK on timestamp column and LIMIT to 1.
  • B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
  • C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
  • D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Answer: D

NEW QUESTION 11

Your team is working on a binary classification problem. You have trained a support vector machine (SVM) classifier with default parameters, and received an area under the Curve (AUC) of 0.87 on the validation set. You want to increase the AUC of the model. What should you do?

  • A. Perform hyperparameter tuning
  • B. Train a classifier with deep neural networks, because neural networks would always beat SVMs
  • C. Deploy the model and measure the real-world AUC; it’s always higher because of generalization
  • D. Scale predictions you get out of the model (tune a scaling factor as a hyperparameter) in order to get the highest AUC

Answer: D

NEW QUESTION 12

Cloud Bigtable is a recommended option for storing very large amounts of _____ ?

  • A. multi-keyed data with very high latency
  • B. multi-keyed data with very low latency
  • C. single-keyed data with very low latency
  • D. single-keyed data with very high latency

Answer: C

Explanation:
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key. Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.
Reference: https://cloud.google.com/bigtable/docs/overview

NEW QUESTION 13

You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You’ve collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?

  • A. Use Cloud Vision AutoML with the existing dataset.
  • B. Use Cloud Vision AutoML, but reduce your dataset twice.
  • C. Use Cloud Vision API by providing custom labels as recognition hints.
  • D. Train your own image recognition model leveraging transfer learning techniques.

Answer: A

NEW QUESTION 14

Which methods can be used to reduce the number of rows processed by BigQuery?

  • A. Splitting tables into multiple tables; putting data in partitions
  • B. Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
  • C. Putting data in partitions; using the LIMIT clause
  • D. Splitting tables into multiple tables; using the LIMIT clause

Answer: A

Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, BigQuery will still process the entire table. Reference: https://cloud.google.com/bigquery/docs/partitioned-tables

NEW QUESTION 15

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of datA. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?

  • A. Encrypted on Cloud Storage with user-supplied encryption key
  • B. A separate decryption key will be given to each authorized user.
  • C. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.
  • D. In Cloud SQL, with separate database user names to each use
  • E. The Cloud SQL Admin activity logs will be used to provide the auditability.
  • F. In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket.

Answer: B

NEW QUESTION 16

Which of the following statements about Legacy SQL and Standard SQL is not true?

  • A. Standard SQL is the preferred query language for BigQuery.
  • B. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
  • C. One difference between the two query languages is how you specify fully-qualified table names (i.
  • D. table names that include their associated project name).
  • E. You need to set a query language for each dataset and the default is Standard SQL.

Answer: D

Explanation:
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released.
In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql

NEW QUESTION 17

Which of the following is not possible using primitive roles?

  • A. Give a user viewer access to BigQuery and owner access to Google Compute Engine instances.
  • B. Give UserA owner access and UserB editor access for all datasets in a project.
  • C. Give a user access to view all datasets in a project, but not run queries on them.
  • D. Give GroupA owner access and GroupB editor access for all datasets in a project.

Answer: C

Explanation:
Primitive roles can be used to give owner, editor, or viewer access to a user or group, but they can't be used to separate data access permissions from job-running permissions.
Reference: https://cloud.google.com/bigquery/docs/access-control#primitive_iam_roles

NEW QUESTION 18

You have an Apache Kafka Cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?

  • A. Deploy a Kafka cluster on GCE VM Instance
  • B. Configure your on-prem cluster to mirror your topics tothe cluster running in GC
  • C. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
  • D. Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connecto
  • E. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
  • F. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connecto
  • G. Use a Dataflow job to read fron PubSub and write to GCS.
  • H. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connecto
  • I. Use a Dataflow job to read fron PubSub and write to GCS.

Answer: A

NEW QUESTION 19

Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?

  • A. A sequential numeric ID
  • B. A timestamp followed by a stock symbol
  • C. A non-sequential numeric ID
  • D. A stock symbol followed by a timestamp

Answer: AB

Explanation:
using a timestamp as the first element of a row key can cause a variety of problems.
In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hotspotting.
Suppose your system assigns a numeric ID to each of your application's users. You might be tempted to use the user's numeric ID as the row key for your table. However, since new users are more likely to be active users, this approach is likely to push most of your traffic to a small number of nodes. [https://cloud.google.com/bigtable/docs/schema-design]
Reference:
https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotti

NEW QUESTION 20

You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?

  • A. Create a Cloud Dataproc Workflow Template
  • B. Create an initialization action to execute the jobs
  • C. Create a Directed Acyclic Graph in Cloud Composer
  • D. Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster

Answer: A

NEW QUESTION 21

You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed. What should you do?

  • A. Create a dedicated service account, and use encryption at rest to reference your data stored in your Compute Engine cluster instances as part of your API service calls.
  • B. Create encryption keys in Cloud Key Management Servic
  • C. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
  • D. Create encryption keys locall
  • E. Upload your encryption keys to Cloud Key Management Servic
  • F. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
  • G. Create encryption keys in Cloud Key Management Servic
  • H. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.

Answer: C

NEW QUESTION 22

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error.
SELECT person FROM `project1.example.table1` WHERE city = "London"
How would you correct the error?

  • A. Add ", UNNEST(person)" before the WHERE clause.
  • B. Change "person" to "person.city".
  • C. Change "person" to "city.person".
  • D. Add ", UNNEST(city)" before the WHERE clause.

Answer: A

Explanation:
To access the person.city column, you need to "UNNEST(person)" and JOIN it to table1 using a comma. Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#nested_repeated_resu

NEW QUESTION 23

You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends. What should you do?

  • A. Use bq load to load a batch of sensor data every 60 seconds.
  • B. Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
  • C. Use the INSERT statement to insert a batch of data every 60 seconds.
  • D. Use the MERGE statement to apply updates in batch every 60 seconds.

Answer: C

NEW QUESTION 24
......

Thanks for reading the newest Professional-Data-Engineer exam dumps! We recommend you to try the PREMIUM Dumpscollection.com Professional-Data-Engineer dumps in VCE and PDF here: https://www.dumpscollection.net/dumps/Professional-Data-Engineer/ (239 Q&As Dumps)