Notes is your knowledge-base management platform where you can curate, take notes, and summarize information from the Internet. Click here to learn more and here to learn how to use.

Cheat sheet: Converting Spark SQL to AWS Athena SQL

Data Collection, Web Scraping, Data Engineering and Cloud Technology Related News and Tutorials

Published Jul 24 '24. Last edited Jul 09 '25

Tutorial #aws #spark #athena

For Data Engineers trying to debug #spark data pipelines using #AWS #Athena which is based on open sourced Presto technology, the little check list below will very likely save 30 minutes to one hour of your effort.

For Data Engineers working with Spark data pipelines on AWS, it's a quite common task for convert Spark SQL to AWS Athena SQL because a common scenario is Spark SQL errors out while running Spark job on a cluster, at that point, next step developers take is to test run the Spark SQL statement in AWS Athena to debug it. AWS Athena is AWS wrapper of open-source Presto SQL execution engine. However, converting Spark SQL to Athena SQL is not straight-forward, as there are a number of syntaxes that are specific to AWS Athena SQL and Spark SQL respectively, a successful conversion will require developers to

know where these syntax differences are
make change to syntax so that they are acceptable by AWS Athena

Here is a list of these syntax differences (this list is work in progress, welcome to comment to add or contribute to it)

Intent	AWS Athena (Presto) SQL	Databricks Spark SQL
String replacement	`replace`	`regexp_replace`
String trimming spaces	`LTRIM` and `RTRIM`	`trim`
Get substring	`substr`	`substring`
Convert string type to date type	`date_parse('yyyy-mm-dd', '%Y-%m-%d')`	`'yyyy-mm-dd'` (string in "yyyy-mm-dd" format can be directly used as date type)
Value of today's date	`current_date`	`current_date()`
Convert date to string based on a given format	`date_format (current_timestamp, '%Y_%m_%d')` per reference where `%Y_%m_%d` formats a date as YYYY_MM_DD	`date_format (current_timestamp, 'yyyyMMdd')` where `yyyyMMdd` formats a date as YYYYMMDD
Convert timestamp to date	timestamp type can directly compare against date type, no need to convert	`to_date(ts)` reference
Pivot rows into columns	`map_agg` (need `GROUP BY` at end of SQL query, reference1 and reference2 on StackOverflow)	`map` (no `GROUP BY` is needed at end of query, function reference on Databricks)
Escaping column or table name containing `.` (dot)	using double quotes, e.g. `"col.name"` or `"table.name"`	using backtick symbol, e.g. `col.name` or `table.name` (reference on StackOverflow)
Find differences between 2 sets	does not support `minus`, but support `EXCEPT` with almost identical functionality	supports both `minus` and `except`
Concatenate strings with a separator	supports `CONCAT_WS` per AWS doc	same
Get length of an array	`cardinality`	`size`
Aggregate values in a column into groups	`array_agg` reference	`collect_set`
Array index	starts at 1	starts at 0

add a comment

Terms of Use: You are in agreement with our Terms of Services and Privacy Policy. If you have any question or concern to any information published on SaveNowClub, please feel free to write to us at savenowclub@gmail.com. For product news and announcements, please check our blog.