Ist das Data Analysis in Python?

Chapter 7 Data Inspection Features

7.1 Project 2.2: Validating cardinal domains — measures, counts, and durations

7.1.1 Description

7.1.2 Approach

7.1.3 Deliverables

7.2 Project 2.3: Validating text and codes — nominal data and ordinal numbers

7.2.1 Description

7.2.2 Approach

7.2.3 Deliverables

7.3 Project 2.4: Finding reference domains

7.3.1 Description

7.3.2 Approach

7.3.3 Deliverables

7.4 Summary

7.5 Extras

7.5.1 Markdown cells with dates and data source information

7.5.2 Presentation materials

7.5.3 JupyterBook or Quarto for even more sophisticated output

Chapter 8 Project 2.5: Schema and Metadata

8.1 Description

8.2 Approach

8.2.1 Define Pydantic classes and emit the JSON Schema

8.2.2 Define expected data domains in JSON Schema notation

8.2.3 Use JSON Schema to validate intermediate files

8.3 Deliverables

8.3.1 Schema acceptance tests

8.3.2 Extended acceptance testing

8.4 Summary

8.5 Extras

8.5.1 Revise all previous chapter models to use Pydantic

8.5.2 Use the ORM layer

Chapter 9 Project 3.1: Data Cleaning Base Application

9.1 Description

9.1.1 User experience

9.1.2 Source data

9.1.3 Result data

9.1.4 Conversions and processing

9.1.5 Error reports

9.2 Approach

9.2.1 Model module refactoring

9.2.2 Pydantic V2 validation

9.2.3 Validation function design

9.2.4 Incremental design

9.2.5 CLI application

9.3 Deliverables

9.3.1 Acceptance tests

9.3.2 Unit tests for the model features

9.3.3 Application to clean data and create an NDJSON interim file

9.4 Summary

9.5 Extras

9.5.1 Create an output file with rejected samples

Chapter 10 Data Cleaning Features

10.1 Project 3.2: Validate and convert source fields

10.1.1 Description

10.1.2 Approach

10.1.3 Deliverables

10.2 Project 3.3: Validate text fields (and numeric coded fields)

10.2.1 Description

10.2.2 Approach

10.2.3 Deliverables

10.3 Project 3.4: Validate references among separate data sources

10.3.1 Description

10.3.2 Approach

10.3.3 Deliverables

10.4 Project 3.5: Standardize data to common codes and ranges

10.4.1 Description

10.4.2 Approach

10.4.3 Deliverables

10.5 Project 3.6: Integration to create an acquisition pipeline

10.5.1 Description

10.5.2 Approach

10.5.3 Deliverables

10.6 Summary

10.7 Extras

10.7.1 Hypothesis testing

10.7.2 Rejecting bad data via filtering (instead of logging)

10.7.3 Disjoint subentities

10.7.4 Create a fan-out cleaning pipeline

Chapter 11 Project 3.7: Interim Data Persistence

11.1 Description

11.2 Overall approach

11.2.1 Designing idempotent operations

11.3 Deliverables

11.3.1 Unit test

11.3.2 Acceptance test

11.3.3 Cleaned up re-runnable application design

11.4 Summary

11.5 Extras

11.5.1 Using a SQL database

11.5.2 Persistence with NoSQL databases

Technik, programmieren, Informatik, Programmiersprache, Python