Cheat sheet: Converting Spark SQL to AWS Athena SQL |
submit Data Collection and Data Pipeline Related Tutorials
Published Jul 24 '24. Last edited Dec 19 '24 |
For Data Engineers trying to debug #spark data pipelines using #AWS #Athena, the little check list below will very likely save 30 minutes to one hour of your effort.
For Data Engineers working with Spark data pipelines on AWS, it's a quite common task for convert Spark SQL to AWS Athena SQL because a common scenario is Spark SQL errors out while running Spark job on a cluster, at that point, next step developers take is to test run the Spark SQL statement in AWS Athena to debug it. AWS Athena is AWS wrapper of open-source Presto SQL execution engine. However, converting Spark SQL to Athena SQL is not straight-forward, as there are a number of syntaxes that are specific to AWS Athena SQL and Spark SQL respectively, a successful conversion will require developers to
Here is a list of these syntax differences (this list is work in progress, welcome to comment to add or contribute to it)
Intent | AWS Athena SQL | Databricks Spark SQL |
---|---|---|
string replacement | replace |
regexp_replace |
string trimming spaces | LTRIM and RTRIM |
trim |
get substring | substr |
substring |
value of a certain date | date_parse('yyyy-mm-dd', '%Y-%m-%d') |
'yyyy-mm-dd' |
value of today's date | current_date |
current_date() |
map function | map_agg (need GROUP BY at end of SQL query, reference1 and reference2 on StackOverflow) |
map (no GROUP BY is needed at end of query, function reference on Databricks) |
escaping column name with . (dot) |
using double quotes, e.g. "col.name" |
using backtick symbol, e.g. `col.name` (reference on StackOverflow) |
Terms of Use: You are in agreement with our Terms of Services and Privacy Policy. If you have any question or concern to any information published on SaveNowClub, please feel free to write to us at savenowclub@gmail.com