Monday, 2 April 2018

dataframe - How to make good reproducible Apache Spark examples



I've been spending a fair amount of time reading through some questions with the and tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth. For example: see the comments on this question.



Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.




So how does one go about creating a good, reproducible example?


Answer



Provide small sample data, that can be easily recreated.



At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.






I have the following dataframe:






+-----+---+-----+----------+
|index| X|label| date|
+-----+---+-----+----------+
| 1| 1| A|2017-01-01|
| 2| 3| B|2017-01-02|
| 3| 5| A|2017-01-03|
| 4| 7| B|2017-01-04|
+-----+---+-----+----------+



which can be created with this code:



df = sqlCtx.createDataFrame(
[
(1, 1, 'A', '2017-01-01'),
(2, 3, 'B', '2017-01-02'),
(3, 5, 'A', '2017-01-03'),
(4, 7, 'B', '2017-01-04')

],
('index', 'X', 'label', 'date')
)





Show the desired output.



Ask your specific question and show us your desired output.







How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?



Desired output:



+-----+---+-----+----------+------------+
|index| X|label| date|is_divisible|
+-----+---+-----+----------+------------+

| 1| 1| A|2017-01-01| yes|
| 2| 3| B|2017-01-02| yes|
| 3| 5| A|2017-01-03| yes|
| 4| 7| B|2017-01-04| no|
+-----+---+-----+----------+------------+





Explain how to get your output.




Explain, in great detail, how you get your desired output. It helps to show an example calculation.






For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.



Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.







Share your existing code.



Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.



(*You can leave out the code to create the spark context, but you should include all imports.)






I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.




from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))





Include versions, imports, and use syntax highlighting









For performance tuning posts, include the execution plan








Parsing spark output files





  • MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.






Other notes.





No comments:

Post a Comment

casting - Why wasn't Tobey Maguire in The Amazing Spider-Man? - Movies & TV

In the Spider-Man franchise, Tobey Maguire is an outstanding performer as a Spider-Man and also reprised his role in the sequels Spider-Man...