Exception in thread “main” java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages

reference:   https://stackoverflow.com/questions/46001583/why-does-spark-submit-fail-to-find-kafka-data-source-unless-packages-is-used

实验成功:

spark-submit.cmd --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 f:\structuredstreaming-test.jar

 

Spark Structured Streaming supports Apache Kafka as the streaming source and sink using the external kafka-0-10-sql module.

kafka-0-10-sql module is not available to Spark applications that are submitted for execution using spark-submit. The module is external and to have it available you should define it as a dependency.

Unless you use kafka-0-10-sql module-specific code in your Spark application you don’t have to define the module as a dependency in pom.xml. You simply don’t need a compilation dependencyon the module since no code uses the module’s code. You code against interfaces which is one of the reasons why Spark SQL is so pleasant to use (i.e. it requires very little to code to have fairly sophisticated distributed application).

spark-submit however will require --packages command-line option that you’ve reported it worked fine.

However when I run my code as mentioned here and which is with the package option:

--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0

The reason it worked fine with --packages is that you have to tell Spark infrastructure where to find the definition of kafka format.

That leads us to the other “issue” (or a requirement) to run streaming Spark applications with Kafka. You have to specify the runtime dependency on spark-sql-kafka module.

You specify a runtime dependency using --packages command-line option (that downloads the necessary jars after you spark-submit your Spark application) or creating a so-called uber-jar (or a fat-jar).

That’s where pom.xml comes to play (and that’s why people offered their help with pom.xml and the module as a dependency).

So, first of all, you have to specify the dependency in pom.xml.

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
  <version>2.2.0</version>
</dependency>

And the last but not least, you have to build an uber-jar that you configure in pom.xml using Apache Maven Shade Plugin.

With Apache Maven Shade Plugin you create an Uber JAR that will include all the “infrastructure” for kafka format to work, inside the Spark application jar file. As a matter of fact, the Uber JAR will contain all the necessary runtime dependencies and so you could spark-submit with the jar alone (and no --packages option or similar).

However when I run my code as mentioned here and which is with the package option:

--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.