Machine Learning with Spark on Google Cloud Dataproc Reviews
10347 reviews
Shiv P. · Reviewed about 7 years ago
Shiv P. · Reviewed about 7 years ago
Shiv P. · Reviewed about 7 years ago
鈺介 陸. · Reviewed about 7 years ago
Juypter notebook cannot be launched with the firewall rules in my environment. Could not do that portion of the lab.
Kevin R. · Reviewed about 7 years ago
Sergey S. · Reviewed about 7 years ago
edward c. · Reviewed about 7 years ago
Andrea C. · Reviewed about 7 years ago
I'll be contacting Qwiklabs for a refund on this lab. I had problems like many others before I was able to get any points. This lab has proved to me that I must read the comments first. The spark.read statement is out-of-date per the version of PySpark that they are using in the lab. I tried to research on StackOverflow and other places but could not make that statement work properly, and that is what is supposed to set up the first dataset: Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'property' object has no attribute 'option' If I could award 0 stars, I would.
Lawrence M. · Reviewed about 7 years ago
WidenResearch w. · Reviewed about 7 years ago
I have completed all the steps but the last 5 points was not registered.
Yap G. · Reviewed about 7 years ago
Devin L. · Reviewed about 7 years ago
Very nice tutorial of PySpark, easy to understand and follow
Nhan D. · Reviewed about 7 years ago
I had all correct output and changed to the correct shard 00004, the lab would not give me credit. Very frustrating!
Eden E. · Reviewed about 7 years ago
Most of the code in pyspark section throws error.
Girish G. · Reviewed about 7 years ago
Was NOT successful. Got errors in spark and did not found the reason.Did Try to repeat all the steps before to avoid such errors (assumed to having forgotten one step as a potential reason for the error).: Connected, host fingerprint: ssh-rsa 0 07:38:4A:8F:17:EA:A2:6C:AA:36:0E:D1:F9:EB :9D:ED:5E:09:95:1F:48:EE:04:0F:20:5E:31:59:D4:B3:E1:E4 Linux ch6cluster-m 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Creating directory '/home/google2832717_student'. google2832717_student@ch6cluster-m:~$ export PROJECT_ID=$(gcloud info --format='value(config.project)') google2832717_student@ch6cluster-m:~$ export BUCKET=${PROJECT_ID} google2832717_student@ch6cluster-m:~$ export ZONE=us-central1-a google2832717_student@ch6cluster-m:~$ pyspark Python 2.7.13 (default, Sep 26 2018, 18:42:22) [GCC 6.3.0 20170516] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/03/23 14:29:46 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not fou nd so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spa rk.scheduler.allocation.file to a file that contains the configuration. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Python version 2.7.13 (default, Sep 26 2018 18:42:22) SparkSession available as 'spark'. >>> from pyspark.mllib.classification import LogisticRegressionWithLBFGS >>> from pyspark.mllib.regression import LabeledPoint >>> >>> "BUCKET=os.environ['BUCKET'] File "<stdin>", line 1 "BUCKET=os.environ['BUCKET'] ^ SyntaxError: EOL while scanning string literal >>> traindays = spark.read \ ... .option(""header"", ""true"") \ File "<stdin>", line 2 .option(""header"", ""true"") \ ^ SyntaxError: invalid syntax >>> .csv('gs://{}/flights/trainday.csv'.format(BUCKET))" File "<stdin>", line 1 .csv('gs://{}/flights/trainday.csv'.format(BUCKET))" ^ IndentationError: unexpected indent >>> >>> from pyspark.mllib.classification import LogisticRegressionWithLBFGS >>> >>> from pyspark.mllib.regression import LabeledPoint >>> BUCKET=os.environ['BUCKET'] >>> traindays = spark.read \ ... .option("header", "true") \ ... .csv('gs://{}/flights/trainday.csv'.format(BUCKET)) ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used >>> traindays.createOrReplaceTempView('traindays') >>> spark.sql("SELECT * from traindays ORDER BY FL_DATE LIMIT 5").show() +----------+------------+ | FL_DATE|is_train_day| +----------+------------+ |2015-01-01| True| >>> flights = spark.read\... .schema(schema)\... .csv(inputs)>>> flights.createOrReplaceTempView('flights')19/03/23 14:37:25 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.>>> trainquery = """... SELECT... F.DEP_DELAY,F.TAXI_OUT,f.ARR_DELAY,F.DISTANCE... FROM flights f... JOIN traindays t... ON f.FL_DATE == t.FL_DATE... WHERE... t.is_train_day == 'True'... """>>> traindata = spark.sql(trainquery)>>> >>> >>> trainquery = """... SELECT... F.DEP_DELAY,F.TAXI_OUT,f.ARR_DELAY,F.DISTANCE... FROM flights f... JOIN traindays t... ON f.FL_DATE == t.FL_DATE... WHERE... t.is_train_day == 'True'... """>>> traindata = spark.sql(trainquery)>>> traindata.head(2)[Row(DEP_DELAY=-2.0, TAXI_OUT=26.0, ARR_DELAY=0.0, DISTANCE=677.0), Row(DEP_DELAY=-2.0, TAXI_OUT=22.0, ARR_DELAY=3.0, DISTANCE=451.0)]>>> traindata.describe().show()+-------+------------------+-----------------+-----------------+-----------------+|summary| DEP_DELAY| TAXI_OUT| ARR_DELAY| DISTANCE|+-------+------------------+-----------------+-----------------+-----------------+| count| 151446| 151373| 150945| 152566|| mean|10.726252261532164|16.11821791204508|5.310126204909073|837.4265432665208|| stddev| 36.38718688562445|8.897148233750972|38.04559816976176|623.0449480656523|| min| -39.0| 1.0| -68.0| 31.0|| max| 1393.0| 168.0| 1364.0| 4983.0|+-------+------------------+-----------------+-----------------+-----------------+>>> def to_example(raw_data_point):... return LabeledPoint(\... float(raw_data_point['ARR_DELAY'] < 15), # on-time? \... [ \... raw_data_point['DEP_DELAY'], \... raw_data_point['TAXI_OUT'], \... raw_data_point['DISTANCE'], \... ])... >>> >>> traindays.createOrReplaceTempView('traindays')>>> >>> spark.sql("SELECT * from traindays ORDER BY FL_DATE LIMIT 5").show()Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: unbound method sql() must be called with SparkSession instance as first argument (got str instance instead)>>> from pyspark.sql.types \... import StringType, FloatType, StructType, StructField>>> header = 'FL_DATE,UNIQUE_CARRIER,AIRLINE_ID,CARRIER,FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CANCELLED,CANCELLATION_CODE,DIVERTED,DISTANCE,DEP_AIRPORT_LAT,DEP_AIRPORT_LON,DEP_AIRPORT_TZOFFSET,ARR_AIRPORT_LAT,ARR_AIRPORT_LON,ARR_AIRPORT_TZOFFSET,EVENT,NOTIFY_TIME'>>> def get_structfield(colname):... if colname in ['ARR_DELAY', 'DEP_DELAY', 'DISTANCE', 'TAXI_OUT']:... return StructField(colname, FloatType(), True)... else:... return StructField(colname, StringType(), True)... >>> schema = StructType([get_structfield(colname) for colname in header.split(',')])>>> inputs = 'gs://{}/flights/tzcorr/all_flights-00004-*'.format(BUCKET)>>> >>> flights = spark.read\... .schema(schema)\... .csv(inputs)Traceback (most recent call last): File "<stdin>", line 1, in <module>AttributeError: 'property' object has no attribute 'schema'>>> flights.createOrReplaceTempView('flights')>>> trainquery = """... SELECT... F.DEP_DELAY,F.TAXI_OUT,f.ARR_DELAY,F.DISTANCE... FROM flights f... JOIN traindays t... ON f.FL_DATE == t.FL_DATE... WHERE... t.is_train_day == 'True'... """>>> traindata = spark.sql(trainquery)Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: unbound method sql() must be called with SparkSession instance as first argument (got str instance instead)>>> >>> traindata.head(2)[Row(DEP_DELAY=-2.0, TAXI_OUT=26.0, ARR_DELAY=0.0, DISTANCE=677.0), Row(DEP_DELAY=-2.0, TAXI_OUT=22.0, ARR_DELAY=3.0, DISTANCE=451.0)]>>> Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: unbound method sql() must be called with SparkSession instance as first argument (got str instance instead)>>> traindata.describe().show()+-------+------------------+-----------------+-----------------+-----------------+|summary| DEP_DELAY| TAXI_OUT| ARR_DELAY| DISTANCE|+-------+------------------+-----------------+-----------------+-----------------+| count| 151446| 151373| 150945| 152566|| mean|10.726252261532164|16.11821791204508|5.310126204909073|837.4265432665208|| stddev| 36.38718688562445|8.897148233750972|38.04559816976176|623.0449480656523|| min| -39.0| 1.0| -68.0| 31.0|| max| 1393.0| 168.0| 1364.0| 4983.0|+-------+------------------+-----------------+-----------------+-----------------+>>> def to_example(raw_data_point):... return LabeledPoint(\... float(raw_data_point['ARR_DELAY'] < 15), # on-time? \... [ \... raw_data_point['DEP_DELAY'], \... raw_data_point['TAXI_OUT'], \... raw_data_point['DISTANCE'], \... ])... >>> examples = traindata.rdd.map(to_example)>>> lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)[Stage 22:> (0 + 2) / 2]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/mllib/classification.py", line 398, in train return _regression_train_wrapper(train, LogisticRegressionModel, data, initialWeights) File "/usr/lib/spark/python/pyspark/mllib/regression.py", line 215, in _regression_train_wrapper data, _convert_to_vector(initial_weights)) File "/usr/lib/spark/python/pyspark/mllib/classification.py", line 388, in train float(tolerance), bool(validateData), int(numClasses)) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 130, in callMLlibFunc return callJavaFunc(sc, api, *args) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 123, in callJavaFunc return _java2py(sc, func(*args)) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)pyspark.sql.utils.IllegalArgumentException: u'requirement failed: init value should <= bound'>>> >>> >>> print lrmodel.weights,lrmodel.interceptTraceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> >>> [-0.17315525007,-0.123703577812,0.00047521823417] 5.26368986835 File "<stdin>", line 1 [-0.17315525007,-0.123703577812,0.00047521823417] 5.26368986835 ^SyntaxError: invalid syntax>>> lrmodel.predict([6.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> >>> lrmodel.predict([36.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> >>> lrmodel.clearThreshold()Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([6.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([36.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> lrmodel.setThreshold(0.7)Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([6.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([36.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> MODEL_FILE='gs://' + BUCKET + '/flights/sparkmloutput/model'>>> os.system('gsutil -m rm -r ' +MODEL_FILE)CommandException: 1 files/objects could not be removed.256>>> lrmodel.save(sc, MODEL_FILE)Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print '{} saved'.format(MODEL_FILE)gs://qwiklabs-gcp-a17a185bd4f73119/flights/sparkmloutput/model saved>>> lrmodel = 0 >>> print lrmodel 0 >>> from pyspark.mllib.classification import LogisticRegressionModel >>> lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE) Traceback (most recent call last): File "<stdin>", line 1, in <module>
Stephan H. · Reviewed about 7 years ago
Adam C. · Reviewed about 7 years ago
Anurag M. · Reviewed about 7 years ago
The notebook is not loading any kernels and thus the steps regarding the notebook cannot be performed.
Ilias S. · Reviewed about 7 years ago
I'll be contacting Qwiklabs for a refund on this lab. I had problems like many others before I was able to get any points. This lab has proved to me that I must read the comments first. The spark.read statement is out-of-date per the version of PySpark that they are using in the lab. I tried to research on StackOverflow and other places but could not make that statement work properly, and that is what is supposed to set up the first dataset: Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'property' object has no attribute 'option' If I could award 0 stars, I would.
Lawrence M. · Reviewed about 7 years ago
Even if I did everything as stated in the lab, I was not able to get the full score. The last part to check the replacement to 00004 in the input did not returned any score
hari s. · Reviewed about 7 years ago
I finished the lab but it didn't give me the last credit for updating the notebook to reference all_flights_00004
Michael V. · Reviewed about 7 years ago
joe k. · Reviewed about 7 years ago
Sakthi Pravin N. · Reviewed about 7 years ago
We do not ensure the published reviews originate from consumers who have purchased or used the products. Reviews are not verified by Google.