Tag: apache-spark

31 What are the use cases for Apache Spark vs Hadoop 2014-06-17T20:48:35.267

27 Merging multiple data frames row-wise in PySpark 2016-04-22T04:27:45.507

13 How to calculate the mean of a dataframe column and find the top 10% 2015-07-22T14:16:22.823

13 Replace all numeric values in a pyspark dataframe by a constant value 2016-10-19T23:22:22.527

12 How to convert categorical data to numerical data in Pyspark 2015-06-29T22:55:28.100

12 Issue with IPython/Jupyter on Spark (Unrecognized alias) 2015-07-23T03:45:36.867

11 Spark ALS: recommending for new users 2016-10-24T21:13:33.707

10 Spark, optimally splitting a single RDD into two 2015-05-01T20:32:51.900

10 Server log analysis using machine learning 2015-11-27T18:11:03.323

10 When does cache get expired for a RDD in pyspark? 2016-05-10T12:38:18.240

10 Calculate cosine similarity in Apache Spark 2016-08-10T05:43:41.613

10 PySpark dataframe repartition 2018-02-22T10:19:01.260

9 How to run a pyspark application in windows 8 command prompt 2015-06-21T17:31:05.457

8 How do I set/get heap size for Spark (via Python notebook) 2015-10-21T18:17:22.190

8 How to select particular column in Spark(pyspark)? 2016-01-03T02:10:10.643

8 Unbalanced class: class_weight for ML algorithms in Spark MLLib 2016-12-07T00:08:48.120

8 Understanding how distributed PCA works 2017-04-19T08:58:18.707

7 Using Apache Spark to do ML. Keep getting serializing errors 2014-07-25T21:03:44.663

7 Why does logistic regression in Spark and R return different models for the same data? 2015-05-07T13:23:47.440

7 SPARK Mllib: Multiclass logistic regression, how to get the probabilities of all classes rather than the top one? 2015-12-17T10:52:10.013

7 Extracting individual emails from an email thread 2017-06-01T13:02:23.683

7 Why does spark.ml.feautures.Word2Vec vectorize sentences instead of single words? 2018-07-27T15:53:51.907

6 Item-Item similarity based on text 2015-07-28T16:15:43.783

6 SPARK, ML: Naive Bayes classifier often assigns 1 as probability prediction 2015-12-16T14:55:27.443

6 Reading CSVs with new lines in fields with Spark 2016-07-11T21:02:40.633

6 What are the alternatives to Python + Spark (pyspark)? 2018-04-23T13:29:03.863

5 Local Development for Apache Spark 2015-02-15T04:51:21.167

5 Can theano work on mapreduce or on spark? 2015-07-09T21:29:17.050

5 Random Forest Regression. How to represent really long list of categories for processing 2015-12-14T16:58:41.163

5 Distributed k-means in Spark 2016-02-10T22:53:49.620

5 Machine Learning in Spark 2016-06-21T09:40:45.333

5 Using Spark for finding similar users to a user? 2017-07-04T12:35:41.023

5 Saving Large Spark ML Pipeline to HDFS 2018-01-08T16:19:33.187

5 How to implement LSTM with Spark? 2019-06-02T14:53:58.690

5 Pyspark: Filter dataframe based on separate specific conditions 2019-06-09T06:22:53.393

4 Choosing between Storm+Trident-ML, Storm+SAMOA or Spark Streaming+MLlib 2015-03-30T04:35:58.667

4 How Mllib in Spark select variables in logistic regression 2015-05-04T13:26:04.767

4 Performance profiling and tuning in Apache Spark 2015-05-07T20:08:05.440

4 Scan-based operations Apache Spark 2015-10-12T15:23:01.260

4 How to start prediction from dataset? 2016-06-09T00:02:39.277

4 Spark MLLib - how to re-use TF-IDF model 2016-11-01T19:07:37.377

4 ALS in Spark: what loss function is it minimizing? 2017-07-03T15:14:24.617

4 Is there any point in learning Hadoop in 2018? 2018-12-23T15:19:13.280

3 Which Spark MLlib regression algorithm is suitable for numeric predictions based on non-numeric features? 2015-11-27T02:54:52.780

3 How to determine Nonnegativity in Matrix Factorization? 2015-12-10T20:22:54.097

3 Sampling with replacement, specify the probabilities 2015-12-18T16:19:00.960

3 ARIMAX with spark-timeseries 2016-01-20T19:03:33.203

3 Why is Spark's LinearRegressionWithSGD very slow locally? 2016-02-28T17:25:28.147

3 Solution for in Time/Space Complexity challenge in Recommendation System? 2016-08-08T05:45:31.677

3 Task not serializable Error 2016-09-14T12:56:58.240

3 Hashing trick with random forest in scala 2016-09-22T08:34:17.947

3 RDD of gziped files to "uncompressed" Dataframe 2016-11-10T23:50:52.223

3 Deploying models on bigdata platforms like Hadoop and Spark 2017-03-09T12:32:53.170

3 Clustering a very large number of very small clusters with most data unrelated 2017-06-12T16:40:59.607

3 How to setup a home-laptop cluster to 'practice' elasticsearch, hadoop, mesos and spark 2017-07-06T18:02:16.953

3 Is there any way to read Xlsx file in pyspark?Also want to read strings of column from each columnName 2017-08-31T09:09:53.743

3 What are the tools to speed up the running time of machine learning algorithms? 2018-02-28T16:55:43.977

3 convert list of tuple of tuple to list of tuple in pySpark 2018-06-10T12:42:59.007

3 Deep Learning in Spark Clusters vs on GPUs? 2018-09-04T06:24:20.607

3 Navigating the jungle of choices for scalable ML deployment 2018-09-07T07:22:17.930

3 Spark DataFrame "Limit" function takes too much time to display result 2019-02-11T09:57:39.567

3 BERT in production 2020-02-27T17:16:12.383

3 User defined aggregations on data of around 200GB where row order matters 2020-06-25T15:02:30.367

3 What is the main difference between Hadoop and Spark? 2020-09-05T11:28:44.113

2 Scalable open source machine learning library written in python 2015-07-09T20:38:22.933

2 What makes a graph algorithm a good candidate for concurrency? 2015-07-28T22:14:31.177

2 How to decide the number of trees parameter for Random Forest algorithm in PySpark MLlib? 2016-01-21T22:51:03.573

2 How to predict an approximate weekly/monthly number, when the Unique Daily Visitors for that week/month are already known 2016-01-25T11:03:17.520

2 Use spark_csv inside Jupyter and using Python 2016-01-25T13:57:24.367

2 Spark ALS-WR giving the same recommended items for all users 2016-02-10T14:50:49.610

2 Which is the most appropiate algorithm to use with Mlib for predicting prices 2016-02-16T08:50:44.807

2 Algorithm Suggestion For a Specific Problem 2016-04-12T12:56:51.450

2 How to read contents of a CSV file inside zip file using spark (python) 2016-05-05T23:43:27.647

2 Unable to load NLTK in spark using PySpark 2016-05-18T03:19:58.333

2 Spark Scala alternative Machine Learning Library? 2016-05-27T09:57:50.057

2 How to interpret upper-triangular matrix of cosine similarities 2016-06-20T14:02:37.407

2 value saveAsTextFile is not a member of org.apache.spark.sql.DataFrame 2016-09-02T11:05:02.417

2 Do categorical features always need to be encoded? 2016-09-13T13:15:01.727

2 ARIMA(X) Validation 2016-09-14T19:10:31.557

2 Apache Spark ML vs Flink ML 2016-10-13T08:26:45.253

2 spark item similarity recommendation 2016-11-01T09:20:22.320

2 Mahout Spark shell not working 2016-11-02T09:44:46.437

2 Spark 1.6.1 - Determining the number of clusters in a data set 2016-11-21T18:46:06.700

2 Order SparseVectors by the closest distance to given SparseVector 2017-03-03T12:45:02.607

2 Model params tuning 2017-03-07T16:08:36.270

2 How to improve naive Bayes multiclass classification accuracy? 2017-06-27T07:08:55.867

2 Why does my master node get heap memory full for inbuilt SVD API in Apache Spark during calculation of inverse of a square matrix? 2017-11-15T11:00:26.433

2 API to find out how many executors are running my Spark jobs? 2017-12-14T03:58:38.120

2 Use cases for graph algorithms and graph data structures in finance and banking 2018-01-09T19:52:15.317

2 Best practice for developing using Spark 2018-02-09T12:59:34.347

2 Run python script using spark-submit on windows 7 2018-04-15T09:27:32.033

2 does storing file in hdfs parallelize it for Spark? 2018-04-28T18:51:11.863

2 PySpark Filter shows only 1 row 2018-05-21T21:43:52.043

2 Alternative to Apache Spark? 2018-09-04T04:45:23.420

2 Plotting in PySpark? 2018-09-06T12:22:52.040

2 cosine similarity between items (purchase data) and normalisation 2018-11-19T10:20:15.130

2 XGBoost most important features appear in multiple trees multiple times 2018-12-19T14:39:43.833

2 What is the difference between PySpark's featuresCol, labelCol, predictionCol, and probabilityCol? 2019-02-01T20:39:46.247

2 Obtain learning curve of Gradient Boosted Tree model in PySpark 2019-03-11T13:59:41.103