Machine Learning with Spark on Google Cloud Dataproc Reviews

10347 reviews

Shiv P. · Reviewed about 7 years ago

Shiv P. · Reviewed about 7 years ago

Shiv P. · Reviewed about 7 years ago

鈺介 陸. · Reviewed about 7 years ago

Juypter notebook cannot be launched with the firewall rules in my environment. Could not do that portion of the lab.

Kevin R. · Reviewed about 7 years ago

Sergey S. · Reviewed about 7 years ago

edward c. · Reviewed about 7 years ago

Andrea C. · Reviewed about 7 years ago

I'll be contacting Qwiklabs for a refund on this lab. I had problems like many others before I was able to get any points. This lab has proved to me that I must read the comments first. The spark.read statement is out-of-date per the version of PySpark that they are using in the lab. I tried to research on StackOverflow and other places but could not make that statement work properly, and that is what is supposed to set up the first dataset: Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'property' object has no attribute 'option' If I could award 0 stars, I would.

Lawrence M. · Reviewed about 7 years ago

WidenResearch w. · Reviewed about 7 years ago

I have completed all the steps but the last 5 points was not registered.

Yap G. · Reviewed about 7 years ago

Devin L. · Reviewed about 7 years ago

Very nice tutorial of PySpark, easy to understand and follow

Nhan D. · Reviewed about 7 years ago

I had all correct output and changed to the correct shard 00004, the lab would not give me credit. Very frustrating!

Eden E. · Reviewed about 7 years ago

Most of the code in pyspark section throws error.

Girish G. · Reviewed about 7 years ago

Was NOT successful. Got errors in spark and did not found the reason.Did Try to repeat all the steps before to avoid such errors (assumed to having forgotten one step as a potential reason for the error).: Connected, host fingerprint: ssh-rsa 0 07:38:4A:8F:17:EA:A2:6C:AA:36:0E:D1:F9:EB :9D:ED:5E:09:95:1F:48:EE:04:0F:20:5E:31:59:D4:B3:E1:E4 Linux ch6cluster-m 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Creating directory '/home/google2832717_student'. google2832717_student@ch6cluster-m:~$ export PROJECT_ID=$(gcloud info --format='value(config.project)') google2832717_student@ch6cluster-m:~$ export BUCKET=${PROJECT_ID} google2832717_student@ch6cluster-m:~$ export ZONE=us-central1-a google2832717_student@ch6cluster-m:~$ pyspark Python 2.7.13 (default, Sep 26 2018, 18:42:22) [GCC 6.3.0 20170516] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/03/23 14:29:46 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not fou nd so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spa rk.scheduler.allocation.file to a file that contains the configuration. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Python version 2.7.13 (default, Sep 26 2018 18:42:22) SparkSession available as 'spark'. >>> from pyspark.mllib.classification import LogisticRegressionWithLBFGS >>> from pyspark.mllib.regression import LabeledPoint >>> >>> "BUCKET=os.environ['BUCKET'] File "<stdin>", line 1 "BUCKET=os.environ['BUCKET'] ^ SyntaxError: EOL while scanning string literal >>> traindays = spark.read \ ... .option(""header"", ""true"") \ File "<stdin>", line 2 .option(""header"", ""true"") \ ^ SyntaxError: invalid syntax >>> .csv('gs://{}/flights/trainday.csv'.format(BUCKET))" File "<stdin>", line 1 .csv('gs://{}/flights/trainday.csv'.format(BUCKET))" ^ IndentationError: unexpected indent >>> >>> from pyspark.mllib.classification import LogisticRegressionWithLBFGS >>> >>> from pyspark.mllib.regression import LabeledPoint >>> BUCKET=os.environ['BUCKET'] >>> traindays = spark.read \ ... .option("header", "true") \ ... .csv('gs://{}/flights/trainday.csv'.format(BUCKET)) ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used >>> traindays.createOrReplaceTempView('traindays') >>> spark.sql("SELECT * from traindays ORDER BY FL_DATE LIMIT 5").show() +----------+------------+ | FL_DATE|is_train_day| +----------+------------+ |2015-01-01| True| >>> flights = spark.read\... .schema(schema)\... .csv(inputs)>>> flights.createOrReplaceTempView('flights')19/03/23 14:37:25 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.>>> trainquery = """... SELECT... F.DEP_DELAY,F.TAXI_OUT,f.ARR_DELAY,F.DISTANCE... FROM flights f... JOIN traindays t... ON f.FL_DATE == t.FL_DATE... WHERE... t.is_train_day == 'True'... """>>> traindata = spark.sql(trainquery)>>> >>> >>> trainquery = """... SELECT... F.DEP_DELAY,F.TAXI_OUT,f.ARR_DELAY,F.DISTANCE... FROM flights f... JOIN traindays t... ON f.FL_DATE == t.FL_DATE... WHERE... t.is_train_day == 'True'... """>>> traindata = spark.sql(trainquery)>>> traindata.head(2)[Row(DEP_DELAY=-2.0, TAXI_OUT=26.0, ARR_DELAY=0.0, DISTANCE=677.0), Row(DEP_DELAY=-2.0, TAXI_OUT=22.0, ARR_DELAY=3.0, DISTANCE=451.0)]>>> traindata.describe().show()+-------+------------------+-----------------+-----------------+-----------------+|summary| DEP_DELAY| TAXI_OUT| ARR_DELAY| DISTANCE|+-------+------------------+-----------------+-----------------+-----------------+| count| 151446| 151373| 150945| 152566|| mean|10.726252261532164|16.11821791204508|5.310126204909073|837.4265432665208|| stddev| 36.38718688562445|8.897148233750972|38.04559816976176|623.0449480656523|| min| -39.0| 1.0| -68.0| 31.0|| max| 1393.0| 168.0| 1364.0| 4983.0|+-------+------------------+-----------------+-----------------+-----------------+>>> def to_example(raw_data_point):... return LabeledPoint(\... float(raw_data_point['ARR_DELAY'] < 15), # on-time? \... [ \... raw_data_point['DEP_DELAY'], \... raw_data_point['TAXI_OUT'], \... raw_data_point['DISTANCE'], \... ])... >>> >>> traindays.createOrReplaceTempView('traindays')>>> >>> spark.sql("SELECT * from traindays ORDER BY FL_DATE LIMIT 5").show()Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: unbound method sql() must be called with SparkSession instance as first argument (got str instance instead)>>> from pyspark.sql.types \... import StringType, FloatType, StructType, StructField>>> header = 'FL_DATE,UNIQUE_CARRIER,AIRLINE_ID,CARRIER,FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CANCELLED,CANCELLATION_CODE,DIVERTED,DISTANCE,DEP_AIRPORT_LAT,DEP_AIRPORT_LON,DEP_AIRPORT_TZOFFSET,ARR_AIRPORT_LAT,ARR_AIRPORT_LON,ARR_AIRPORT_TZOFFSET,EVENT,NOTIFY_TIME'>>> def get_structfield(colname):... if colname in ['ARR_DELAY', 'DEP_DELAY', 'DISTANCE', 'TAXI_OUT']:... return StructField(colname, FloatType(), True)... else:... return StructField(colname, StringType(), True)... >>> schema = StructType([get_structfield(colname) for colname in header.split(',')])>>> inputs = 'gs://{}/flights/tzcorr/all_flights-00004-*'.format(BUCKET)>>> >>> flights = spark.read\... .schema(schema)\... .csv(inputs)Traceback (most recent call last): File "<stdin>", line 1, in <module>AttributeError: 'property' object has no attribute 'schema'>>> flights.createOrReplaceTempView('flights')>>> trainquery = """... SELECT... F.DEP_DELAY,F.TAXI_OUT,f.ARR_DELAY,F.DISTANCE... FROM flights f... JOIN traindays t... ON f.FL_DATE == t.FL_DATE... WHERE... t.is_train_day == 'True'... """>>> traindata = spark.sql(trainquery)Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: unbound method sql() must be called with SparkSession instance as first argument (got str instance instead)>>> >>> traindata.head(2)[Row(DEP_DELAY=-2.0, TAXI_OUT=26.0, ARR_DELAY=0.0, DISTANCE=677.0), Row(DEP_DELAY=-2.0, TAXI_OUT=22.0, ARR_DELAY=3.0, DISTANCE=451.0)]>>> Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: unbound method sql() must be called with SparkSession instance as first argument (got str instance instead)>>> traindata.describe().show()+-------+------------------+-----------------+-----------------+-----------------+|summary| DEP_DELAY| TAXI_OUT| ARR_DELAY| DISTANCE|+-------+------------------+-----------------+-----------------+-----------------+| count| 151446| 151373| 150945| 152566|| mean|10.726252261532164|16.11821791204508|5.310126204909073|837.4265432665208|| stddev| 36.38718688562445|8.897148233750972|38.04559816976176|623.0449480656523|| min| -39.0| 1.0| -68.0| 31.0|| max| 1393.0| 168.0| 1364.0| 4983.0|+-------+------------------+-----------------+-----------------+-----------------+>>> def to_example(raw_data_point):... return LabeledPoint(\... float(raw_data_point['ARR_DELAY'] < 15), # on-time? \... [ \... raw_data_point['DEP_DELAY'], \... raw_data_point['TAXI_OUT'], \... raw_data_point['DISTANCE'], \... ])... >>> examples = traindata.rdd.map(to_example)>>> lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)[Stage 22:> (0 + 2) / 2]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/mllib/classification.py", line 398, in train return _regression_train_wrapper(train, LogisticRegressionModel, data, initialWeights) File "/usr/lib/spark/python/pyspark/mllib/regression.py", line 215, in _regression_train_wrapper data, _convert_to_vector(initial_weights)) File "/usr/lib/spark/python/pyspark/mllib/classification.py", line 388, in train float(tolerance), bool(validateData), int(numClasses)) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 130, in callMLlibFunc return callJavaFunc(sc, api, *args) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 123, in callJavaFunc return _java2py(sc, func(*args)) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)pyspark.sql.utils.IllegalArgumentException: u'requirement failed: init value should <= bound'>>> >>> >>> print lrmodel.weights,lrmodel.interceptTraceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> >>> [-0.17315525007,-0.123703577812,0.00047521823417] 5.26368986835 File "<stdin>", line 1 [-0.17315525007,-0.123703577812,0.00047521823417] 5.26368986835 ^SyntaxError: invalid syntax>>> lrmodel.predict([6.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> >>> lrmodel.predict([36.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> >>> lrmodel.clearThreshold()Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([6.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([36.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> lrmodel.setThreshold(0.7)Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([6.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print lrmodel.predict([36.0,12.0,594.0])Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> MODEL_FILE='gs://' + BUCKET + '/flights/sparkmloutput/model'>>> os.system('gsutil -m rm -r ' +MODEL_FILE)CommandException: 1 files/objects could not be removed.256>>> lrmodel.save(sc, MODEL_FILE)Traceback (most recent call last): File "<stdin>", line 1, in <module>NameError: name 'lrmodel' is not defined>>> print '{} saved'.format(MODEL_FILE)gs://qwiklabs-gcp-a17a185bd4f73119/flights/sparkmloutput/model saved>>> lrmodel = 0 >>> print lrmodel 0 >>> from pyspark.mllib.classification import LogisticRegressionModel >>> lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE) Traceback (most recent call last): File "<stdin>", line 1, in <module>

Stephan H. · Reviewed about 7 years ago

Adam C. · Reviewed about 7 years ago

Anurag M. · Reviewed about 7 years ago

The notebook is not loading any kernels and thus the steps regarding the notebook cannot be performed.

Ilias S. · Reviewed about 7 years ago

I'll be contacting Qwiklabs for a refund on this lab. I had problems like many others before I was able to get any points. This lab has proved to me that I must read the comments first. The spark.read statement is out-of-date per the version of PySpark that they are using in the lab. I tried to research on StackOverflow and other places but could not make that statement work properly, and that is what is supposed to set up the first dataset: Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'property' object has no attribute 'option' If I could award 0 stars, I would.

Lawrence M. · Reviewed about 7 years ago

Even if I did everything as stated in the lab, I was not able to get the full score. The last part to check the replacement to 00004 in the input did not returned any score

hari s. · Reviewed about 7 years ago

I finished the lab but it didn't give me the last credit for updating the notebook to reference all_flights_00004

Michael V. · Reviewed about 7 years ago

joe k. · Reviewed about 7 years ago

Sakthi Pravin N. · Reviewed about 7 years ago

We do not ensure the published reviews originate from consumers who have purchased or used the products. Reviews are not verified by Google.