9.3.3 Application to clean data and create an NDJSON interim file
9.4 Summary
9.5 Extras
9.5.1 Create an output file with rejected samples
Chapter 10 Data Cleaning Features
10.1 Project 3.2: Validate and convert source fields
10.1.1 Description
10.1.2 Approach
10.1.3 Deliverables
10.2 Project 3.3: Validate text fields (and numeric coded fields)
10.2.1 Description
10.2.2 Approach
10.2.3 Deliverables
10.3 Project 3.4: Validate references among separate data sources
10.3.1 Description
10.3.2 Approach
10.3.3 Deliverables
10.4 Project 3.5: Standardize data to common codes and ranges
10.4.1 Description
10.4.2 Approach
10.4.3 Deliverables
10.5 Project 3.6: Integration to create an acquisition pipeline
10.5.1 Description
10.5.2 Approach
10.5.3 Deliverables
10.6 Summary
10.7 Extras
10.7.1 Hypothesis testing
10.7.2 Rejecting bad data via filtering (instead of logging)
10.7.3 Disjoint subentities
10.7.4 Create a fan-out cleaning pipeline
Chapter 11 Project 3.7: Interim Data Persistence
11.1 Description
11.2 Overall approach
11.2.1 Designing idempotent operations
11.3 Deliverables
11.3.1 Unit test
11.3.2 Acceptance test
11.3.3 Cleaned up re-runnable application design