Chapter 7 Data Inspection Features
7.1 Project 2.2: Validating cardinal domains — measures, counts, and durations
7.1.1 Description
7.1.2 Approach
7.1.3 Deliverables
7.2 Project 2.3: Validating text and codes — nominal data and ordinal numbers
7.2.1 Description
7.2.2 Approach
7.2.3 Deliverables
7.3 Project 2.4: Finding reference domains
7.3.1 Description
7.3.2 Approach
7.3.3 Deliverables
7.4 Summary
7.5 Extras
7.5.1 Markdown cells with dates and data source information
7.5.2 Presentation materials
7.5.3 JupyterBook or Quarto for even more sophisticated output
Chapter 8 Project 2.5: Schema and Metadata
8.1 Description
8.2 Approach
8.2.1 Define Pydantic classes and emit the JSON Schema
8.2.2 Define expected data domains in JSON Schema notation
8.2.3 Use JSON Schema to validate intermediate files
8.3 Deliverables
8.3.1 Schema acceptance tests
8.3.2 Extended acceptance testing
8.4 Summary
8.5 Extras
8.5.1 Revise all previous chapter models to use Pydantic
8.5.2 Use the ORM layer
Chapter 9 Project 3.1: Data Cleaning Base Application
9.1 Description
9.1.1 User experience
9.1.2 Source data
9.1.3 Result data
9.1.4 Conversions and processing
9.1.5 Error reports
9.2 Approach
9.2.1 Model module refactoring
9.2.2 Pydantic V2 validation
9.2.3 Validation function design
9.2.4 Incremental design
9.2.5 CLI application
9.3 Deliverables
9.3.1 Acceptance tests
9.3.2 Unit tests for the model features
9.3.3 Application to clean data and create an NDJSON interim file
9.4 Summary
9.5 Extras
9.5.1 Create an output file with rejected samples
Chapter 10 Data Cleaning Features
10.1 Project 3.2: Validate and convert source fields
10.1.1 Description
10.1.2 Approach
10.1.3 Deliverables
10.2 Project 3.3: Validate text fields (and numeric coded fields)
10.2.1 Description
10.2.2 Approach
10.2.3 Deliverables
10.3 Project 3.4: Validate references among separate data sources
10.3.1 Description
10.3.2 Approach
10.3.3 Deliverables
10.4 Project 3.5: Standardize data to common codes and ranges
10.4.1 Description
10.4.2 Approach
10.4.3 Deliverables
10.5 Project 3.6: Integration to create an acquisition pipeline
10.5.1 Description
10.5.2 Approach
10.5.3 Deliverables
10.6 Summary
10.7 Extras
10.7.1 Hypothesis testing
10.7.2 Rejecting bad data via filtering (instead of logging)
10.7.3 Disjoint subentities
10.7.4 Create a fan-out cleaning pipeline
Chapter 11 Project 3.7: Interim Data Persistence
11.1 Description
11.2 Overall approach
11.2.1 Designing idempotent operations
11.3 Deliverables
11.3.1 Unit test
11.3.2 Acceptance test
11.3.3 Cleaned up re-runnable application design
11.4 Summary
11.5 Extras
11.5.1 Using a SQL database
11.5.2 Persistence with NoSQL databases