The .withColumn(...) transformation applies a function to some other columns and/or literals (using the .lit(...) method) and stores it as a new function. In SQL, this could be any method that applies any transformation to any of the columns and uses AS to assign a new column name. This transformation extends the original DataFrame.
Look at the following code snippet:
# split the HDD into size and type
(
sample_data_schema
.withColumn('HDDSplit', f.split(f.col('HDD'), ' '))
.show()
)
It produces the following output:
You could achieve the same result with the .select(...) transformation. The following code will produce the same result:
# do the same as withColumn
(
sample_data_schema
.select(
f.col('*')
, f.split(f.col('HDD'), ' ').alias('HDD_Array')
).show()
)
The SQL (T-SQL) equivalent would be:
SELECT *
, STRING_SPLIT(HDD, ' ') AS HDD_Array
FROM sample_data_schema