我正在尝试在Spark Shell中使用twitterUtils(默认情况下它们不可用).
我添加了以下内容spark-env.sh
:
SPARK_CLASSPATH="/disk.b/spark-master-2014-07-28/external/twitter/target/spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar"
我现在可以执行了
import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming.StreamingContext._
没有shell中的错误,如果没有将jar添加到类路径中是不可能的("错误:对象twitter不是包org.apache.spark.streaming的成员").但是,在Spark shell中执行它时会出错:
scala> val ssc = new StreamingContext(sc, Seconds(1)) ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@6e78177b scala> val tweets = TwitterUtils.createStream(ssc, "twitter.txt") error: bad symbolic reference. A signature in TwitterUtils.class refers to term twitter4j in packagewhich is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling TwitterUtils.class.
我错过了什么?我必须再导入另一个罐子吗?
是的,除了你已有的之外,你需要Twitter4J JAR spark-streaming-twitter
.具体来说,Spark开发者建议使用Twitter4J 3.0.3版.
下载正确的JAR后,您将需要通过--jars
标志将它们传递给shell .我想你也可以SPARK_CLASSPATH
像你一样做到这一点.
以下是我在Spark EC2集群上的表现:
#!/bin/bash cd /root/spark/lib mkdir twitter4j # Get the Spark Streaming JAR. curl -O "http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-streaming-twitter_2.10/1.0.0/spark-streaming-twitter_2.10-1.0.0.jar" # Get the Twitter4J JARs. Check out http://twitter4j.org/archive/ for other versions. TWITTER4J_SOURCE=twitter4j-3.0.3.zip curl -O "http://twitter4j.org/archive/$TWITTER4J_SOURCE" unzip -j ./$TWITTER4J_SOURCE "lib/*.jar" -d twitter4j/ rm $TWITTER4J_SOURCE cd # Point the shell to these JARs and go! TWITTER4J_JARS=`ls -m /root/spark/lib/twitter4j/*.jar | tr -d '\n'` /root/spark/bin/spark-shell --jars /root/spark/lib/spark-streaming-twitter_2.10-1.0.0.jar,$TWITTER4J_JARS