We use string manipulation functions quite extensively. Here are some of the important functions which we typically use.
Case Conversion - lower
, upper
, initcap
Getting size of the column value - length
Extracting Data - substr
and split_part
Trimming and Padding functions - trim
, rtrim
, ltrim
, rpad
and lpad
Reversing strings - reverse
Concatenating multiple strings concat
and concat_ws
Let us understand how to perform case conversion of a string and also get length of a string.
Case Conversion Functions - lower
, upper
, initcap
Getting length - length
Let us see how to use these functions on top of the table. We will use orders table which was loaded as part of last section.
order_status for all the orders is in upper case and we will convert every thing to lower case.
order_id | order_date | order_customer_id | order_status |
---|---|---|---|
1021 | 2013-07-30 00:00:00 | 10118 | COMPLETE |
4068 | 2013-08-17 00:00:00 | 12293 | PENDING |
5881 | 2013-08-30 00:00:00 | 3715 | CLOSED |
7564 | 2013-09-09 00:00:00 | 8648 | CLOSED |
8766 | 2013-09-18 00:00:00 | 855 | COMPLETE |
8926 | 2013-09-19 00:00:00 | 10517 | ON_HOLD |
9290 | 2013-09-21 00:00:00 | 11879 | COMPLETE |
9793 | 2013-09-24 00:00:00 | 9809 | COMPLETE |
9816 | 2013-09-24 00:00:00 | 1753 | COMPLETE |
14047 | 2013-10-20 00:00:00 | 6473 | CLOSED |
order_id | order_date | order_customer_id | order_status | order_status_length |
---|---|---|---|---|
1021 | 2013-07-30 00:00:00 | 10118 | complete | 8 |
4068 | 2013-08-17 00:00:00 | 12293 | pending | 7 |
5881 | 2013-08-30 00:00:00 | 3715 | closed | 6 |
7564 | 2013-09-09 00:00:00 | 8648 | closed | 6 |
8766 | 2013-09-18 00:00:00 | 855 | complete | 8 |
8926 | 2013-09-19 00:00:00 | 10517 | on_hold | 7 |
9290 | 2013-09-21 00:00:00 | 11879 | complete | 8 |
9793 | 2013-09-24 00:00:00 | 9809 | complete | 8 |
9816 | 2013-09-24 00:00:00 | 1753 | complete | 8 |
14047 | 2013-10-20 00:00:00 | 6473 | closed | 6 |
Let us understand how to extract data from strings using substr
/substring
as well as split_part
.
We can extract sub string from main string using substr
or substring
position and length.
For example, get first 4 characters from date to get year or get last 4 characters from fixed length unique id.
substring
have broader options (regular expression) and also can be used with different styles (using keywords such as FROM
, FOR
).
Unlike in other relational databases, we cannot pass negative integers to substr
or substring
to get the information from right. We need to use functions like right
instead.
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30-6851' UNION
SELECT '586-92-5361' UNION
SELECT '884-65-284' UNION
SELECT '876-99-585' UNION
SELECT '831-59-5593' UNION
SELECT '399-88-3617' UNION
SELECT '733-17-4217' UNION
SELECT '873-68-9778' UNION
SELECT '48'
) SELECT unique_id,
substring(unique_id FROM 1 FOR 3) AS unique_id_first3,
substring(unique_id FROM '....$') AS unique_id_last4
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_first3 | unique_id_last4 |
---|---|---|
241-80-7115 | 241 | 7115 |
399-88-3617 | 399 | 3617 |
48 | 48 | None |
586-92-5361 | 586 | 5361 |
694-30-6851 | 694 | 6851 |
733-17-4217 | 733 | 4217 |
831-59-5593 | 831 | 5593 |
873-68-9778 | 873 | 9778 |
876-99-585 | 876 | -585 |
884-65-284 | 884 | -284 |
Let us see how we can extract date part from order_date of orders.
order_id | order_date | order_customer_id | order_status |
---|---|---|---|
1021 | 2013-07-30 00:00:00 | 10118 | COMPLETE |
4068 | 2013-08-17 00:00:00 | 12293 | PENDING |
5881 | 2013-08-30 00:00:00 | 3715 | CLOSED |
7564 | 2013-09-09 00:00:00 | 8648 | CLOSED |
8766 | 2013-09-18 00:00:00 | 855 | COMPLETE |
8926 | 2013-09-19 00:00:00 | 10517 | ON_HOLD |
9290 | 2013-09-21 00:00:00 | 11879 | COMPLETE |
9793 | 2013-09-24 00:00:00 | 9809 | COMPLETE |
9816 | 2013-09-24 00:00:00 | 1753 | COMPLETE |
14047 | 2013-10-20 00:00:00 | 6473 | CLOSED |
order_id | order_date | order_customer_id | order_status |
---|---|---|---|
1021 | 2013-07-30 | 10118 | COMPLETE |
4068 | 2013-08-17 | 12293 | PENDING |
5881 | 2013-08-30 | 3715 | CLOSED |
7564 | 2013-09-09 | 8648 | CLOSED |
8766 | 2013-09-18 | 855 | COMPLETE |
8926 | 2013-09-19 | 10517 | ON_HOLD |
9290 | 2013-09-21 | 11879 | COMPLETE |
9793 | 2013-09-24 | 9809 | COMPLETE |
9816 | 2013-09-24 | 1753 | COMPLETE |
14047 | 2013-10-20 | 6473 | CLOSED |
Let us understand how to extract the information from the string where there is a delimiter.
split_part
can be used to split a string using delimiter and extract the information.
If there is no data in a given position after splitting, it will be represented as empty string ‘’.
%%sql
WITH addresses AS (
SELECT '593 Fair Oaks Pass, Frankfort, Kentucky, 40618' AS address UNION
SELECT ', Vancouver, Washington, 98687' UNION
SELECT '83047 Glacier Hill Circle, Sacramento, California, 94237' UNION
SELECT '935 Columbus Junction, Cincinnati, Ohio, 45213' UNION
SELECT '03010 Nevada Crossing, El Paso, Texas, 88579' UNION
SELECT '9 Dunning Circle, , Arizona, 85271' UNION
SELECT '96 Fair Oaks Way, Decatur, Illinois, 62525' UNION
SELECT '999 Caliangt Avenue, Greenville, South Carolina, 29615' UNION
SELECT '2 Saint Paul Trail, Bridgeport, , 06673' UNION
SELECT '3 Reindahl Center, Ogden, Utah'
) SELECT split_part(address, ', ', 1) street,
split_part(address, ', ', 2) city,
split_part(address, ', ', 3) state,
split_part(address, ', ', 4) postal_code
FROM addresses
ORDER BY postal_code
street | city | state | postal_code |
---|---|---|---|
3 Reindahl Center | Ogden | Utah | |
2 Saint Paul Trail | Bridgeport | 06673 | |
999 Caliangt Avenue | Greenville | South Carolina | 29615 |
593 Fair Oaks Pass | Frankfort | Kentucky | 40618 |
935 Columbus Junction | Cincinnati | Ohio | 45213 |
96 Fair Oaks Way | Decatur | Illinois | 62525 |
9 Dunning Circle | Arizona | 85271 | |
03010 Nevada Crossing | El Paso | Texas | 88579 |
83047 Glacier Hill Circle | Sacramento | California | 94237 |
Vancouver | Washington | 98687 |
%%sql
WITH addresses AS (
SELECT '593 Fair Oaks Pass, Frankfort, Kentucky, 40618' AS address UNION
SELECT ', Vancouver, Washington, 98687' UNION
SELECT '83047 Glacier Hill Circle, Sacramento, California, 94237' UNION
SELECT '935 Columbus Junction, Cincinnati, Ohio, 45213' UNION
SELECT '03010 Nevada Crossing, El Paso, Texas, 88579' UNION
SELECT '9 Dunning Circle, , Arizona, 85271' UNION
SELECT '96 Fair Oaks Way, Decatur, Illinois, 62525' UNION
SELECT '999 Caliangt Avenue, Greenville, South Carolina, 29615' UNION
SELECT '2 Saint Paul Trail, Bridgeport, , 06673' UNION
SELECT '3 Reindahl Center, Ogden, Utah'
) SELECT split_part(address, ', ', 1) street,
split_part(address, ', ', 2) city,
split_part(address, ', ', 3) state,
split_part(address, ', ', 4) postal_code
FROM addresses
WHERE split_part(address, ', ', 1) = ''
ORDER BY postal_code
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30-6851' UNION
SELECT '586-92-5361' UNION
SELECT '884-65-284' UNION
SELECT '876-99-585' UNION
SELECT '831-59-5593' UNION
SELECT '399-88-3617' UNION
SELECT '733-17-4217' UNION
SELECT '873-68-9778' UNION
SELECT '480-69-032'
) SELECT unique_id,
substring(unique_id FROM 1 FOR 3) AS unique_id_first3,
substring(unique_id FROM '....$') AS unique_id_last4,
CASE WHEN length(split_part(unique_id, '-', 3)) = 4
THEN split_part(unique_id, '-', 3)
ELSE 'Invalid'
END AS unique_id_last
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_first3 | unique_id_last4 | unique_id_last |
---|---|---|---|
241-80-7115 | 241 | 7115 | 7115 |
399-88-3617 | 399 | 3617 | 3617 |
480-69-032 | 480 | -032 | Invalid |
586-92-5361 | 586 | 5361 | 5361 |
694-30-6851 | 694 | 6851 | 6851 |
733-17-4217 | 733 | 4217 | 4217 |
831-59-5593 | 831 | 5593 | 5593 |
873-68-9778 | 873 | 9778 | 9778 |
876-99-585 | 876 | -585 | Invalid |
884-65-284 | 884 | -284 | Invalid |
At times we might want to get the position of a substring in a main string. For example, we might want to check whether email ids have @ in them. We can use functions such as position
or strpos
.
%%sql
WITH email_ids AS (
SELECT 'bsellan0@yellowbook.com' AS email_id UNION
SELECT 'rstelljes1@illinois.edu' UNION
SELECT 'mmalarkey2@webeden.co.uk' UNION
SELECT 'emussared3@redcross.org' UNION
SELECT 'livashin4@bloglovin.com' UNION
SELECT 'gkeach5@cbc.ca' UNION
SELECT 'emasham6@xing.com' UNION
SELECT 'rcobbald7@house.gov' UNION
SELECT 'rdrohan8@washingtonpost.com' UNION
SELECT 'aebben9@arstechnica.com'
)
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
(psycopg2.errors.SyntaxError) syntax error at end of input
LINE 12: )
^
[SQL: WITH email_ids AS (
SELECT 'bsellan0@yellowbook.com' AS email_id UNION
SELECT 'rstelljes1@illinois.edu' UNION
SELECT 'mmalarkey2@webeden.co.uk' UNION
SELECT 'emussared3@redcross.org' UNION
SELECT 'livashin4@bloglovin.com' UNION
SELECT 'gkeach5@cbc.ca' UNION
SELECT 'emasham6@xing.com' UNION
SELECT 'rcobbald7@house.gov' UNION
SELECT 'rdrohan8@washingtonpost.com' UNION
SELECT 'aebben9@arstechnica.com'
)]
(Background on this error at: http://sqlalche.me/e/13/f405)
Let us understand how to trim or remove leading and/or trailing spaces in a string.
ltrim
is used to remove the spaces on the left side of the string.
rtrim
is used to remove the spaces on the right side of the string.
trim
is used to remove the spaces on both sides of the string.
Let us understand how to use padding to pad characters to a string.
Let us assume that there are 3 fields - year, month and date which are of type integer.
If we have to concatenate all the 3 fields and create a date, we might have to pad month and date with 0.
lpad
is used more often than rpad
especially when we try to build the date from separate columns.
Let us understand how to reverse a string as well as concatenate multiple strings.
We can use reverse
to reverse a string.
We can concatenate multiple strings using concat
and concat_ws
.
concat_ws
is typically used if we want to have the same string between all the strings that are being concatenated.
result |
---|
Order Status is COMPLETE |
Order Status is PENDING |
Order Status is CLOSED |
Order Status is CLOSED |
Order Status is COMPLETE |
Order Status is ON_HOLD |
Order Status is COMPLETE |
Order Status is COMPLETE |
Order Status is COMPLETE |
Order Status is CLOSED |
Let us go through the details related to string replacement.
replace
can be used to replace a sub string with in a string with another string.
overlay
can be used to replace a sub string with in a string by position with another string.
translate
can be used to replace individual characters with other characters.
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30-6851' UNION
SELECT '586-92-5361' UNION
SELECT '884-65-2844' UNION
SELECT '876-99-5856' UNION
SELECT '831-59-5593' UNION
SELECT '399-88-3617' UNION
SELECT '733-17-4217' UNION
SELECT '873-68-9778' UNION
SELECT '487-21-9802'
) SELECT unique_id,
replace(unique_id, '-', ' ') AS unique_id_replaced,
translate(unique_id, '-', ' ') AS unique_id_translated,
overlay(unique_id PLACING ' ' FROM 4 FOR 1) AS unique_id_overlaid
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_replaced | unique_id_translated | unique_id_overlaid |
---|---|---|---|
241-80-7115 | 241 80 7115 | 241 80 7115 | 241 80-7115 |
399-88-3617 | 399 88 3617 | 399 88 3617 | 399 88-3617 |
487-21-9802 | 487 21 9802 | 487 21 9802 | 487 21-9802 |
586-92-5361 | 586 92 5361 | 586 92 5361 | 586 92-5361 |
694-30-6851 | 694 30 6851 | 694 30 6851 | 694 30-6851 |
733-17-4217 | 733 17 4217 | 733 17 4217 | 733 17-4217 |
831-59-5593 | 831 59 5593 | 831 59 5593 | 831 59-5593 |
873-68-9778 | 873 68 9778 | 873 68 9778 | 873 68-9778 |
876-99-5856 | 876 99 5856 | 876 99 5856 | 876 99-5856 |
884-65-2844 | 884 65 2844 | 884 65 2844 | 884 65-2844 |
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30:6851' UNION
SELECT '586-92-5361' UNION
SELECT '884:65-2844' UNION
SELECT '876/99-5856' UNION
SELECT '831-59:5593' UNION
SELECT '399-88-3617' UNION
SELECT '733:17-4217' UNION
SELECT '873:68-9778' UNION
SELECT '487-21/9802'
) SELECT unique_id,
replace(replace(unique_id, '-', ' '), ':', ' ') AS unique_id_replaced,
translate(unique_id, '-:/', ' ') AS unique_id_translated,
overlay(overlay(unique_id PLACING ' ' FROM 4 FOR 1) PLACING ' ' FROM 7 FOR 1) AS unique_id_overlaid
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_replaced | unique_id_translated | unique_id_overlaid |
---|---|---|---|
241-80-7115 | 241 80 7115 | 241 80 7115 | 241 80 7115 |
399-88-3617 | 399 88 3617 | 399 88 3617 | 399 88 3617 |
487-21/9802 | 487 21/9802 | 487 21 9802 | 487 21 9802 |
586-92-5361 | 586 92 5361 | 586 92 5361 | 586 92 5361 |
694-30:6851 | 694 30 6851 | 694 30 6851 | 694 30 6851 |
733:17-4217 | 733 17 4217 | 733 17 4217 | 733 17 4217 |
831-59:5593 | 831 59 5593 | 831 59 5593 | 831 59 5593 |
873:68-9778 | 873 68 9778 | 873 68 9778 | 873 68 9778 |
876/99-5856 | 876/99 5856 | 876 99 5856 | 876 99 5856 |
884:65-2844 | 884 65 2844 | 884 65 2844 | 884 65 2844 |
%%sql
WITH phone_numbers AS (
SELECT '+86 (238) 954-9649' AS phone_number UNION
SELECT '+420 (331) 900-5807' UNION
SELECT '+1 (320) 484-4495' UNION
SELECT '+45 (238) 961-9801' UNION
SELECT '+51 (123) 545-6543' UNION
SELECT '+63 (308) 354-2560' UNION
SELECT '+86 (433) 851-1260' UNION
SELECT '+63 (332) 705-0319' UNION
SELECT '+351 (147) 359-3767' UNION
SELECT '+57 (714) 557-0468'
) SELECT phone_number,
translate(phone_number, '+() -', '') phone_number_int
FROM phone_numbers
ORDER BY phone_number
phone_number | phone_number_int |
---|---|
+1 (320) 484-4495 | 13204844495 |
+351 (147) 359-3767 | 3511473593767 |
+420 (331) 900-5807 | 4203319005807 |
+45 (238) 961-9801 | 452389619801 |
+51 (123) 545-6543 | 511235456543 |
+57 (714) 557-0468 | 577145570468 |
+63 (308) 354-2560 | 633083542560 |
+63 (332) 705-0319 | 633327050319 |
+86 (238) 954-9649 | 862389549649 |
+86 (433) 851-1260 | 864338511260 |