如何规划和管理你的机器学习项目
Project layout is critical for machine learning projects just as it is for software development projects. I think of it like language. A project layout organizes thoughts and gives you context for ideas just like knowing the names for things gives you the basis for thinking.
In this post I want to highlight some considerations in the layout and management of your machine learning project. This is very much related to the goals of project and science reproducibility. There is no “best” way, you will want to select and adopt the practices that best meet your predilections and project requirements.
Workflow Motivating Questions
Jeromy Anglim gave a presentation at the Melbourne R Users group in 2010 on the state of project layout for R. The video is a bit shaky but provides a good discussion on the topic.
I really like the motivation questions from Jeromy’s presentation:
-
Divide a project into files and folders?
-
Incorporate R analyses into a report?
-
Convert default R output into publication quality tables, figures, and text?
-
Build the final product?
-
Sequence the analyses?
-
Divide code into functions?
You can checkout the summary of the presentation on Jeromy’s blog, the PDF presentation slides and the YouTube video of the presentation.
Goals for Project Workflow
David Smith provides a summary of what he believes are the goals of a good project workflow in the post titled A workflow for R. I think these are excellent and should be kept in mind when designing your own project layout.
-
Transparency: Logical and clear layout for the project making it intuitive for the reader.
-
Maintainability: Easy to modify the project with standard names for files and directories.
-
Modularity: Discrete tasks separated into separate scripts with a single responsibility.
-
Portability: Easy to move a project to another system (relative paths and known dependencies)
-
Reproducibility: Easily run and create the same artefacts by you in the future or another person.
-
Efficiency: Less thought on meta project details like the tools and more on the problems you are solving.
ProjectTemplate
John Myles White has an R project called ProjectTemplate that aims to automatically create a well defined layout for a statistical analysis project. It provides conventions and utilities for automatically loading and munging data.
The logo for ProjectTemplate, a project for laying out your R statistical analysis project.
The project layout is larger than I would prefer, but provides insight into a highly-structured way for organizing your project.
-
cache: Preprocessed datasets that don’t need to be re-generated every time you perform an analysis.
-
config: Configuration settings for the project
-
data: Raw data files.
-
munge: Preprocessing data munging code, the outputs of which are put in cache.
-
src: Statistical analysis scripts.
-
diagnostics: Scripts to diagnose data sets for corruption or outliers.
-
doc: Documentation written about the analysis.
-
graphs: Graphs created from analysis.
-
lib: Helper library functions but not the core statistical analysis.
-
logs: Output of scripts and any automatic logging.
-
profiling: Scripts to benchmark the timing of your code.
-
reports: Output reports and content that might go into reports such as tables.
-
tests: Unit tests and regression suite for your code.
-
README: Notes that orient any newcomers to the project.
-
TODO: list of future improvements and bug fixes you plan to make.
You can learn more on the ProjectTemplate homepage, the blog post on John’s website theGitHub page for development and the CRAN page for distribution.
Data Management
Software Carpentry provides a short presentation titled Data Management. The approach to data management is inspired by an article by William Stafford Noble titled A Quick Guide to Organizing Computational Biology Projects.
The presentation describes problems with maintaining multiple versions of data on disk or in version control. It comments that the main requirement in data archiving and proposes an approach of dated directory names and data file metadata files that are themselves managed in version control. It’s an interesting approach.
You can review the video and slides for the presentation here.
Best Practices
There is a lot of discussion of best practices for project layout and code organization for data analysis projects on question and answer sites. For example, some popular examples include:
A good example is the question How to efficiently manage a statistical analysis project? which was turned into a community wiki describing the best practices. In summary, these practices are divided into the following sections:
-
Data management: Use a directory structure, never modify raw data directly, check data consistency, use GNU make.
-
Coding: Organize code into functional units, document everything, custom functions in a dedicated file.
-
Analysis: Record your random seeds, separate parameters into config files, use multivariate plots
-
Versioning: use version control, backup everything, use an issue tracker.
-
Editing/Reporting: Combine code and reporting and use formal report generators.
More Practices
Each project I try to refine my project layout. It’s hard because the projects vary with data and aims as do the languages and tools. I’ve tried all compiled code and all scripting language versions. Some good tips I’ve discovered include:
-
Stick to a POSIX filesystem layout (var, etc, bin, lib, so on).
-
Put all commands in scripts.
-
Call all scripts from GNU make targets.
-
Have make targets that create environment and download public datasets.
-
Create recipes and let the infrastructure check and create any missing output products each run.
This last point is a game changer. It allows you to pipeline your workflow and define recipes with wild abandon for tasks like data analysis, preprocessing, model configuration, feature selection, etc. The framework knows how to execute recipes and creates results for you to review. I’ve talked about this approach before.
微信名:
HadoopSummit
微信ID:
hadoopinchina
中国Hadoop技术峰会是亚太地区举办最早、规模最大、影响力最广阔的大数据盛会。
Chinahadoop.com是China Hadoop Summit的内容网站。
HadoopSummit是Chinahadoop.com的微信发布平台。