Strategies for multiple transformations
To select_part() with a varying delimiter - use CASE WHEN
Creating new temporary tables from the messy tables:
(1)
CREATE TEMP TABLE recode AS
SELECT DISTINCT messy_column AS original
messy_column AS standardized
FROM original_table;
(2)
UPDATE recode
SET standardized = new_value
WHERE conditions
(3)
Joining the temporary table with the original table
Splitting and concatenating text
SELECT substring(string FROM START FOR length) SELECT split_part('a,bc,d', delimiter, part) returns the requested part after splitting based on the delimiter SELECT concat('a', 2, 'b') return a2b; SELECT column1||column2||column3; concat() skips the NULL, whill || returns nothing if any NULL appears
Converting cases
upper/lower('text') - changes case, removes inconsistencies in cases, does not influence punctuation
WHERE column ILIKE '%text%' - case insensitive (I) comparison of 'text', where any additional symbols before and after 'text' are included
SELECT trim('text', 'text_to_remove') - a good way to truncate text; SELECT trim(lower('text'), 'text_to_remove') - no need to take into account capitalization;
Character data types
Types: character (n) or char(n) character varying (n) or varchar(n) text or varchar
Types of text: categorical - standartized values like days of weeks, codes unstructured - text, set of uncontrolled uniqie values
SELECT category, count(*) FROM table GROUP by category ORDER BY count/ORDER BY category- counting and ordering the categories ordering by letters - alphabetical; spaces come before capitals before lowercases
Issues - spaces are not the same as empty strings; empty strings are not NULL; punctuation matters;
Creating temporary tables
CREATE TEMP TABLE tablename AS SELECT column1, column2, ... FROM table
or
SELECT column1, column2 ... INTO TEMP TABLE tablename FROM table
INSERT INTO tablename SELECT column1, column2, ... FROM table
DROP TABLE tablename - delets the tables DROP TABLE IF EXISTS tablename - to avoid errors
Exploring numerical data
SELECT corr(column1, column2) - correlation between columns (-1, 1) SELECT percentile_disc/percentile_cont(percentile) WITHIN GROUP (ORDER BY(column)) - if percentile = 0.5 - calculates discrete/continous median; continuous might return the value that is not inside the column Issues - error numbers - 9, 99, -99; Non-existing - 0, NaN, N/A; outliers; non-numeric data
Exploring distributions:
- counting a number of occurences of a value, group by and order by value sometimes truncation is useful - trunc(number, digits); positive digits - to remove values after the decimal point to leave the number after the decimal point equal to digit; negative digits - used before decimal point to replace the number of values by zero, the number is equal to digits
generate_series(start, end, step) = np.arange
Numeric data types and summary functions
integer: int, smallint, bigint - 4 bytes, 2 bytes, 8 bytes serial, smallserial, bigserial - used to generate unique value (like id) when a new entry is added to the column
decimal: decimal/numeric - variable precision real - 4 bytes - 6 deciaml digits precision double precision - 8 bytes 15 decimal digits precision
FUNCTIONS: min, max, avg var_pop - population variance (divided by n) var_samp - sample variance (divided by n-1), variance() - alias for var_samp stddev(), round
Column constaints
Foreign key - value that exists in a reference column or NULL Primary key - unique, not NULL Unique - al values are different or NULL NOT NULL - anything but NULL
Data types - common: Numeric Character Date/time Boolean
special: array, monetary, binary, geometric etc.
CAST() - casting the datatype to another type only for current querry
What's in the database:
count(*) - number of rows in the table count(column) - number of non NULL values count(DISTINCT column) - different non-NULL values in the column SELECT DISTINCT column - selects all different values INCLUDING NULL values
Foreign keys and foreign key relationships Basically a there is a column of unique ids - it is a primary key, these ids appear in different columns maybe in different tables and they are not necessarily unique - these are foreign keys. The column is a foreign key referencing another column if only this another column has unique values. Otherwise the referencing in ambiguous
- references another raw in a different table or the same table with a unique ID, they are stored in a primary key column
- Values restricted to values in referenced column or NULL coalesce(column1, column2, ...) - returns the first value that is not NULL
Arrow id:id - notation for foreign keys Primary keys column is shown by a red arrow
# Start coding here...