Exploratory data analysis in SQL

Strategies for multiple transformations

To select_part() with a varying delimiter - use CASE WHEN
Creating new temporary tables from the messy tables:

(1) 
CREATE TEMP TABLE recode AS
SELECT DISTINCT messy_column AS original
messy_column AS standardized
FROM original_table;

(2)
UPDATE recode
SET standardized = new_value
WHERE conditions

(3)
Joining the temporary table with the original table

Splitting and concatenating text

SELECT substring(string FROM START FOR length) SELECT split_part('a,bc,d', delimiter, part) returns the requested part after splitting based on the delimiter SELECT concat('a', 2, 'b') return a2b; SELECT column1||column2||column3; concat() skips the NULL, whill || returns nothing if any NULL appears

Converting cases

upper/lower('text') - changes case, removes inconsistencies in cases, does not influence punctuation

WHERE column ILIKE '%text%' - case insensitive (I) comparison of 'text', where any additional symbols before and after 'text' are included

SELECT trim('text', 'text_to_remove') - a good way to truncate text; SELECT trim(lower('text'), 'text_to_remove') - no need to take into account capitalization;

Character data types

Types: character (n) or char(n) character varying (n) or varchar(n) text or varchar

Types of text: categorical - standartized values like days of weeks, codes unstructured - text, set of uncontrolled uniqie values

SELECT category, count(*) FROM table GROUP by category ORDER BY count/ORDER BY category- counting and ordering the categories ordering by letters - alphabetical; spaces come before capitals before lowercases

Issues - spaces are not the same as empty strings; empty strings are not NULL; punctuation matters;

Creating temporary tables

CREATE TEMP TABLE tablename AS SELECT column1, column2, ... FROM table

SELECT column1, column2 ... INTO TEMP TABLE tablename FROM table

INSERT INTO tablename SELECT column1, column2, ... FROM table

DROP TABLE tablename - delets the tables DROP TABLE IF EXISTS tablename - to avoid errors

Exploring numerical data

SELECT corr(column1, column2) - correlation between columns (-1, 1) SELECT percentile_disc/percentile_cont(percentile) WITHIN GROUP (ORDER BY(column)) - if percentile = 0.5 - calculates discrete/continous median; continuous might return the value that is not inside the column Issues - error numbers - 9, 99, -99; Non-existing - 0, NaN, N/A; outliers; non-numeric data

Exploring distributions:

counting a number of occurences of a value, group by and order by value sometimes truncation is useful - trunc(number, digits); positive digits - to remove values after the decimal point to leave the number after the decimal point equal to digit; negative digits - used before decimal point to replace the number of values by zero, the number is equal to digits

generate_series(start, end, step) = np.arange

Numeric data types and summary functions

integer: int, smallint, bigint - 4 bytes, 2 bytes, 8 bytes serial, smallserial, bigserial - used to generate unique value (like id) when a new entry is added to the column

decimal: decimal/numeric - variable precision real - 4 bytes - 6 deciaml digits precision double precision - 8 bytes 15 decimal digits precision

FUNCTIONS: min, max, avg var_pop - population variance (divided by n) var_samp - sample variance (divided by n-1), variance() - alias for var_samp stddev(), round

Column constaints

Foreign key - value that exists in a reference column or NULL Primary key - unique, not NULL Unique - al values are different or NULL NOT NULL - anything but NULL

Data types - common: Numeric Character Date/time Boolean

special: array, monetary, binary, geometric etc.

CAST() - casting the datatype to another type only for current querry

What's in the database:

count(*) - number of rows in the table count(column) - number of non NULL values count(DISTINCT column) - different non-NULL values in the column SELECT DISTINCT column - selects all different values INCLUDING NULL values

Foreign keys and foreign key relationships Basically a there is a column of unique ids - it is a primary key, these ids appear in different columns maybe in different tables and they are not necessarily unique - these are foreign keys. The column is a foreign key referencing another column if only this another column has unique values. Otherwise the referencing in ambiguous

references another raw in a different table or the same table with a unique ID, they are stored in a primary key column
Values restricted to values in referenced column or NULL coalesce(column1, column2, ...) - returns the first value that is not NULL

Arrow id:id - notation for foreign keys Primary keys column is shown by a red arrow

# Start coding here...