[Aug 22, 2024] Free Databricks Certification Databricks-Certified-Professional-Data-Engineer Exam Question [Q21-Q40]

[Aug 22, 2024] Free Databricks Certification Databricks-Certified-Professional-Data-Engineer Exam Question

Databricks-Certified-Professional-Data-Engineer dumps & Databricks Certification sure practice dumps

Databricks Certified Professional Data Engineer exam is a hands-on exam that requires the candidate to complete a set of tasks using Databricks. Databricks-Certified-Professional-Data-Engineer exam evaluates the candidate's ability to design and implement data pipelines, work with data sources and sinks, and perform transformations using Databricks. Databricks-Certified-Professional-Data-Engineer exam also tests the candidate's ability to optimize and tune data pipelines for performance and reliability.

Achieving the Databricks Certified Professional Data Engineer certification is a valuable asset for data professionals. It demonstrates that the individual has the necessary knowledge and skills to work with big data and cloud computing technologies, specifically with Databricks Unified Analytics Platform. Databricks Certified Professional Data Engineer Exam certification is recognized by leading organizations and can help individuals advance their careers in the field of data engineering. It also provides a competitive advantage over other candidates while applying for jobs in the field of big data and cloud computing.

By passing the DCPDE exam, data engineers can demonstrate their proficiency in using the Databricks platform to build scalable and reliable data pipelines. Databricks Certified Professional Data Engineer Exam certification can help data engineers advance their careers and increase their earning potential by showcasing their expertise in data engineering on Databricks.

NEW QUESTION # 21
What is the top-level object in unity catalog?

A. Catalog
B. Workspace
C. Table
D. Metastore
E. Database

Answer: D

Explanation:
Explanation
Key concepts - Azure Databricks | Microsoft Docs

Diagram Description automatically generated

NEW QUESTION # 22
An external object storage container has been mounted to the location/mnt/finance_eda_bucket.
The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of thefinancegroup, which statement describes how thetx_sales table will be created?

A. A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.
B. An external table will be created in the storage container mounted to /mnt/finance eda bucket.
C. A managed table will be created in the DBFS root storage container.
D. A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.
E. An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

Answer: E

Explanation:
Explanation
https://docs.databricks.com/en/lakehouse/data-objects.html

NEW QUESTION # 23
The data engineering team is using a bunch of SQL queries to review data quality and monitor the ETL job every day, which of the following approaches can be used to set up a schedule and auto-mate this process?

A. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.
B. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL
C. They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL.
D. They can schedule the query to run every 1 day from the Jobs UI
E. They can schedule the query to run every 12 hours from the Jobs UI.

Answer: A

Explanation:
Explanation
Explanation
Individual queries can be refreshed on a schedule basis,
To set the schedule:
1. Click the query info tab.
Graphical user interface, text, application, email Description automatically generated

* Click the link to the right of Refresh Schedule to open a picker with schedule intervals.
Graphical user interface, application Description automatically generated

* Set the schedule.
The picker scrolls and allows you to choose:
* An interval: 1-30 minutes, 1-12 hours, 1 or 30 days, 1 or 2 weeks
* A time. The time selector displays in the picker only when the interval is greater than 1 day and the day selection is greater than 1 week. When you schedule a specific time, Databricks SQL takes input in your computer's timezone and converts it to UTC. If you want a query to run at a certain time in UTC, you must adjust the picker by your local offset. For example, if you want a query to execute at 00:00 UTC each day, but your current timezone is PDT (UTC-7), you should select 17:00 in the picker:
Graphical user interface Description automatically generated

* Click OK.
Your query will run automatically.
If you experience a scheduled query not executing according to its schedule, you should manually trigger the query to make sure it doesn't fail. However, you should be aware of the following:
* If you schedule an interval-for example, "every 15 minutes"-the interval is calculated from the last successful execution. If you manually execute a query, the scheduled query will not be executed until the interval has passed.
* If you schedule a time, Databricks SQL waits for the results to be "outdated". For example, if you have a query set to refresh every Thursday and you manually execute it on Wednesday, by Thursday the results will still be considered "valid", so the query wouldn't be scheduled for a new execution. Thus, for example, when setting a weekly schedule, check the last query execution time and expect the scheduled query to be executed on the selected day after that execution is a week old. Make sure not to manually execute the query during this time.
If a query execution fails, Databricks SQL retries with a back-off algorithm. The more failures the further away the next retry will be (and it might be beyond the refresh interval).
Refer documentation for additional info,
https://docs.microsoft.com/en-us/azure/databricks/sql/user/queries/schedule-query

NEW QUESTION # 24
What is the main difference between the silver layer and the gold layer in medalion architecture?

A. Silver may contain aggregated data
B. God is a copy of silver data
C. Gold may contain aggregated data
D. Data quality checks are applied in gold
E. Silver is a copy of bronze data

Answer: C

Explanation:
Explanation
Medallion Architecture - Databricks
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.
A diagram of a house Description automatically generated with low confidence

NEW QUESTION # 25
The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is namedstore_saies_summaryand the schema is as follows:

The tabledaily_store_salescontains all the information needed to updatestore_sales_summary. The schema for this table is:
store_id INT, sales_date DATE, total_sales FLOAT
Ifdaily_store_salesis implemented as a Type 1 table and the column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in thestore_sales_summarytable?

A. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.
B. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
C. Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
D. Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.
E. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

Answer: D

Explanation:
Explanation
The daily_store_sales table contains all the information needed to update store_sales_summary. The schema of the table is:
store_id INT, sales_date DATE, total_sales FLOAT
The daily_store_sales table is implemented as a Type 1 table, which means that old values are overwritten by new values and no history is maintained. The total_sales column might be adjusted after manual data auditing, which means that the data in the table may change over time.
The safest approach to generate accurate reports in the store_sales_summary table is to use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update. Structured Streaming is a scalable and fault-tolerant stream processing engine built on Spark SQL. Structured Streaming allows processing data streams as if they were tables or DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured Streaming also supports output modes that specify how to write the results of a streaming query to a sink, such as append, update, or complete. Structured Streaming can handle both streaming and batch data sources in a unified manner.
The change data feed is a feature of Delta Lake that provides structured streaming sources that can subscribe to changes made to a Delta Lake table. The change data feed captures both data changes and schema changes as ordered events that can be processed by downstream applications or services. The change data feed can be configured with different options, such as starting from a specific version or timestamp, filtering by operation type or partition values, or excluding no-op changes.
By using Structured Streaming to subscribe to the change data feed for daily_store_sales, one can capture and process any changes made to the total_sales column due to manual data auditing. By applying these changes to the aggregates in the store_sales_summary table with each update, one can ensure that the reports are always consistent and accurate with the latest data. Verified References: [Databricks Certified Data Engineer Professional], under "Spark Core" section; Databricks Documentation, under "Structured Streaming" section; Databricks Documentation, under "Delta Change Data Feed" section.

NEW QUESTION # 26
You are asked to setup an AUTO LOADER to process the incoming data, this data arrives in JSON format and get dropped into cloud object storage and you are required to process the data as soon as it arrives in cloud storage, which of the following statements is correct

A. AUTO LOADER needs to be converted to a Structured stream process
B. AUTO LOADER has to be triggered from an external process when the file arrives in the cloud storage
C. AUTO LOADER can only process continuous data when stored in DELTA lake
D. AUTO LOADER can support file notification method so it can process data as it arrives
E. AUTO LOADER is native to DELTA lake it cannot support external cloud object storage

Answer: D

Explanation:
Explanation
Auto Loader supports two modes when ingesting new files from cloud object storage Directory listing: Auto Loader identifies new files by listing the input directory, and uses a directory polling approach.
File notification: Auto Loader can automatically set up a notification service and queue service that subscribe to file events from the input directory.
Diagram Description automatically generated

File notification is more efficient and can be used to process the data in real-time as data arrives in cloud object storage.
Choosing between file notification and directory listing modes | Databricks on AWS

NEW QUESTION # 27
A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.
A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:
Which limitation will the team face while diagnosing this problem?

A. New fields not be computed for historic records.
B. Spark cannot capture the topic partition fields from the kafka source.
C. Updating the table schema will invalidate the Delta transaction log metadata.
D. Updating the table schema requires a default value provided for each file added.

Answer: A

Explanation:
When adding new fields to a Delta table's schema, these fields will not be retrospectively applied to historical records that were ingested before the schema change. Consequently, while the team can use the new metadata fields to investigate transient processing delays moving forward, they will be unable to apply this diagnostic approach to past data that lacks these fields.
References:
* Databricks documentation on Delta Lake schema management:
https://docs.databricks.com/delta/delta-batch.html#schema-management

NEW QUESTION # 28
You noticed a colleague is manually copying the data to the backup folder prior to running an up-date command, incase if the update command did not provide the expected outcome so he can use the backup copy to replace table, which Delta Lake feature would you recommend simplifying the process?

A. Use SHADOW copy of the table as preferred backup choice
B. Cloud object storage retains previous version of the file
C. Use DEEP CLONE to clone the table prior to update to make a backup copy
D. Use time travel feature to refer old data instead of manually copying
E. Cloud object storage automatically backups the data

Answer: D

Explanation:
Explanation
The answer is, Use time travel feature to refer old data instead of manually copying.
https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
1.SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01"
2.SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
3.SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01 01:30:00.000"

NEW QUESTION # 29
Which of the following describes a benefit of a data lakehouse that is unavailable in a traditional data
warehouse?

A. A data lakehouse enables both batch and streaming analytics
B. A data lakehouse utilizes proprietary storage formats for data
C. A data lakehouse captures snapshots of data for version control purposes
D. A data lakehouse provides a relational system of data management
E. A data lakehouse couples storage and compute for complete control

Answer: A

NEW QUESTION # 30
Which one of the following is not a Databricks lakehouse object?

A. Catalog
B. Views
C. Database/Schemas
D. Tables
E. Stored Procedures
F. Functions

Answer: E

Explanation:
Explanation
The answer is, Stored Procedures.
Databricks lakehouse does not support stored procedures.

NEW QUESTION # 31
A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
Theuser_ltvtable has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:
SELECT * FROM email_ltv
Which statement describes the results returned by this query?

A. Three columns will be returned, but one column will be named "redacted" and contain only null values.
B. The email and ltv columns will be returned with the values in user itv.
C. Only the email and itv columns will be returned; the email column will contain all null values.
D. The email, age. and ltv columns will be returned with the values in user ltv.
E. Only the email and ltv columns will be returned; the email column will contain the string
"REDACTED" in each row.

Answer: E

Explanation:
Explanation
The code creates a view called email_ltv that selects the email and ltv columns from a table called user_ltv, which has the following schema: email STRING, age INT, ltv INT. The code alsouses the CASE WHEN expression to replace the email values with the string "REDACTED" if the user is not a member of the marketing group. The user who executes the query is not a member of the marketing group, so they will only see the email and ltv columns, and the email column will contain the string "REDACTED" in each row.
Verified References: [Databricks Certified Data Engineer Professional], under "Lakehouse" section; Databricks Documentation, under "CASE expression" section.

NEW QUESTION # 32
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete this task.

A. await("event_time + '10 minutes'")
B. withWatermark("event_time", "10 minutes")
C. awaitArrival("event_time", "10 minutes")
D. slidingWindow("event_time", "10 minutes")
E. delayWrite("event_time", "10 minutes")

Answer: B

Explanation:
The correct answer is A. withWatermark("event_time", "10 minutes"). This is because the question asks for incremental state information to be maintained for 10 minutes for late-arriving data. The withWatermark method is used to define the watermark for late data. The watermark is a timestamp column and a threshold that tells the system how long to wait for late data. In this case, the watermark is set to 10 minutes. The other options are incorrect because they are not valid methods or syntax for watermarking in Structured Streaming. References:
* Watermarking: https://docs.databricks.com/spark/latest/structured-streaming/watermarks.html
* Windowed aggregations:
https://docs.databricks.com/spark/latest/structured-streaming/window-operations.html

NEW QUESTION # 33
A data engineer wants to create a relational object by pulling data from two tables. The relational object must
be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to
avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?

A. View
B. Delta Table
C. Database
D. Spark SQL Table
E. Temporary view

Answer: A

NEW QUESTION # 34
A team member is leaving the team and he/she is currently the owner of the few tables, instead of transfering the ownership to a user you have decided to transfer the ownership to a group so in the future anyone in the group can manage the permissions rather than a single individual, which of the following commands help you accomplish this?

A. TRANSFER OWNER table_name to 'group'
B. ALTER TABLE table_name OWNER to 'group'
C. ALTER OWNER ON table_name to 'group'
D. GRANT OWNER On table_name to 'group'
E. GRANT OWNER table_name to 'group'*

Answer: B

Explanation:
Explanation
The answer is ALTER TABLE table_name OWNER to 'group'
Assign owner to object

NEW QUESTION # 35
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A. workspace
B. configure
C. libraries
D. fs
E. jobs

Answer: C

Explanation:
Explanation
The libraries command group allows you to install, uninstall, and list libraries on Databricks clusters. You can use the libraries install command to install a custom Python Wheel on a cluster by specifying the --whl option and the path to the wheel file. For example, you can use the following command to install a custom Python Wheel named mylib-0.1-py3-none-any.whl on a cluster with the id 1234-567890-abcde123:
databricks libraries install --cluster-id 1234-567890-abcde123 --whl
dbfs:/mnt/mylib/mylib-0.1-py3-none-any.whl
This will upload the custom Python Wheel to the cluster and make it available for use with a production job.
You can also use the libraries uninstall command to uninstall a library from a cluster, and the libraries list command to list the libraries installed on a cluster.
References:
Libraries CLI (legacy): https://docs.databricks.com/en/archive/dev-tools/cli/libraries-cli.html Library operations: https://docs.databricks.com/en/dev-tools/cli/commands.html#library-operations Install or update the Databricks CLI: https://docs.databricks.com/en/dev-tools/cli/install.html

NEW QUESTION # 36
What is the purpose of the bronze layer in a Multi-hop architecture?

A. Contains aggregated data that is to be consumed into Silver
B. Provides efficient storage and querying of full unprocessed history of data
C. Can be used to eliminate duplicate records
D. Used as a data source for Machine learning applications.
E. Perform data quality checks, corrupt data quarantined

Answer: B

Explanation:
Explanation
The answer is Provides efficient storage and querying of full unprocessed history of data Medallion Architecture - Databricks Bronze Layer:
1.Raw copy of ingested data
2.Replaces traditional data lake
3.Provides efficient storage and querying of full, unprocessed history of data
4.No schema is applied at this layer
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.

NEW QUESTION # 37
A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.
Which code block attempts to perform an invalid stream-static join?

A. streamingDF.join(userLookup, ["user_id"], how="left")
B. streamingDF.join(userLookup, ["userid"], how="inner")
C. userLookup.join(streamingDF, ["user_id"], how="right")
D. streamingDF.join(userLookup, ["user_id"], how="outer")
E. userLookup.join(streamingDF, ["userid"], how="inner")

Answer: C

Explanation:
In Spark Structured Streaming, certain types of joins between a static DataFrame and a streaming DataFrame are not supported. Specifically, a right outer join where the static DataFrame is on the left side and the streaming DataFrame is on the right side is not valid. This is because Spark Structured Streaming cannot handle scenarios where it has to wait for new rows to arrive in the streaming DataFrame to match rows in the static DataFrame. The other join types listed (inner, left, and full outer joins) are supported in streaming-static DataFrame joins.
References:
* Structured Streaming Programming Guide: Join Operations
* Databricks Documentation on Stream-Static Joins: Databricks Stream-Static Joins

NEW QUESTION # 38
A particular job seems to be performing slower and slower over time, the team thinks this started to happen when a recent production change was implemented, you were asked to take look at the job history and see if we can identify trends and root cause, where in the workspace UI can you perform this analysis?

A. Under Compute UI, select Job cluster and select the job cluster to see last 60 day his-torical runs
B. Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run
C. Historical job runs can only be accessed by REST API
D. Under jobs UI select the job cluster, under spark UI select the application job logs, then you can access last 60 day historical runs
E. Under Workspace logs, select job logs and select the job you want to monitor to view the last 60 day historical runs

Answer: B

Explanation:
Explanation
The answer is,
Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run

NEW QUESTION # 39
In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

A. Data Preparation
B. Communicate Results
C. Discovery
D. Model Building

Answer: A

NEW QUESTION # 40
......

Databricks Databricks-Certified-Professional-Data-Engineer Actual Questions and Braindumps: https://braindumps.free4torrent.com/Databricks-Certified-Professional-Data-Engineer-valid-dumps-torrent.html

[Aug 22, 2024] Free Databricks Certification Databricks-Certified-Professional-Data-Engineer Exam Question [Q21-Q40]

Related Articles

Contact Us