Introduction
|
|
Formatting Spreadsheets as Data Tables
|
A relational database lets you compute on multiple tables at once.
Never modify your raw data. Always make a copy before making any changes.
Keep track of all of the steps you take to clean your data.
Organize your data according to tidy data principles.
Record metadata in a separate plain text file.
|
Formatting Problems
|
Avoid using multiple tables within one spreadsheet.
Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).
Record zeros as zeros.
Use an appropriate null value to record missing data.
Don’t use formatting to convey information or to make your spreadsheet look pretty.
Place comments in a separate column.
Record units in column headers.
Include only one piece of information in a cell.
Avoid spaces, numbers and special characters in column headers.
Avoid special characters in your data.
|
Dates as Data
|
|
Using DB Browser for SQLite
|
The DB Browser for SQLite application allows you to connect to an existing database or create a new database
When connected to a database you can create new tables
When connected to a database you can write and run SQL queries and view the results
You can save the results of a query to a file
|
Selecting Data
|
A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.
A database manager is a program that manipulates information stored in a database.
We write queries in a specialized language called SQL to extract information from databases.
Use SELECT… FROM… to get values from a database table.
SQL is case-insensitive (but data is case-sensitive).
|
Sorting and Removing Duplicates
|
The records in a database table are not intrinsically ordered: if we want to display them in some order, we must specify that explicitly with ORDER BY.
The values in a database are not guaranteed to be unique: if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT.
|
Filtering
|
Use WHERE to specify conditions that records must meet in order to be included in a query’s results.
Use AND, OR, and NOT to combine tests.
Filtering is done on whole records, so conditions can use fields that are not actually displayed.
Write queries incrementally.
|
Combining Data
|
Use JOIN to combine data from two tables.
Use table.field notation to refer to fields when doing joins.
Every fact should be represented in a database exactly once.
A join produces all combinations of records from one table with records from another.
A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
The most common join condition is matching keys.
|
Calculating New Values
|
|
Missing Data
|
Databases use a special value called NULL to represent missing information.
Almost all operations on NULL produce NULL.
Queries can test for NULLs using IS NULL and IS NOT NULL.
|
Aggregation
|
Use aggregation functions to combine multiple values.
Aggregation functions ignore null values.
Aggregation happens after filtering.
Use GROUP BY to combine subsets separately.
If no aggregation function is specified for a field, the query may return an arbitrary value for that field.
|
Programming with Databases - Python
|
General-purpose languages have libraries for accessing databases.
To connect to a database, a program must use a library specific to that database manager.
These libraries use a connection-and-cursor model.
Programs can read query results in batches or all at once.
Queries should be written using parameter substitution, not string formatting.
|
Creating and Modifying Data
|
Use CREATE and DROP to create and delete tables.
Use INSERT to add data.
Use UPDATE to modify existing data.
Use DELETE to remove data.
It is simpler and safer to modify data when every record has a unique primary key.
Do not create dangling references by deleting records that other records refer to.
|