如何规划和管理你的机器学习项目

由哈里森 · 2015年11月30日

Project layout is critical for machine learning projects just as it is for software development projects. I think of it like language. A project layout organizes thoughts and gives you context for ideas just like knowing the names for things gives you the basis for thinking.

In this post I want to highlight some considerations in the layout and management of your machine learning project. This is very much related to the goals of project and science reproducibility. There is no “best” way, you will want to select and adopt the practices that best meet your predilections and project requirements.

Workflow Motivating Questions

Jeromy Anglim gave a presentation at the Melbourne R Users group in 2010 on the state of project layout for R. The video is a bit shaky but provides a good discussion on the topic.

I really like the motivation questions from Jeromy’s presentation:

Divide a project into files and folders?
Incorporate R analyses into a report?
Convert default R output into publication quality tables, figures, and text?
Build the final product?
Sequence the analyses?
Divide code into functions?

You can checkout the summary of the presentation on Jeromy’s blog, the PDF presentation slides and the YouTube video of the presentation.

Goals for Project Workflow

David Smith provides a summary of what he believes are the goals of a good project workflow in the post titled A workflow for R. I think these are excellent and should be kept in mind when designing your own project layout.

Transparency: Logical and clear layout for the project making it intuitive for the reader.
Maintainability: Easy to modify the project with standard names for files and directories.
Modularity: Discrete tasks separated into separate scripts with a single responsibility.
Portability: Easy to move a project to another system (relative paths and known dependencies)
Reproducibility: Easily run and create the same artefacts by you in the future or another person.
Efficiency: Less thought on meta project details like the tools and more on the problems you are solving.

ProjectTemplate

John Myles White has an R project called ProjectTemplate that aims to automatically create a well defined layout for a statistical analysis project. It provides conventions and utilities for automatically loading and munging data.

The logo for ProjectTemplate, a project for laying out your R statistical analysis project.
The project layout is larger than I would prefer, but provides insight into a highly-structured way for organizing your project.

cache: Preprocessed datasets that don’t need to be re-generated every time you perform an analysis.
config: Configuration settings for the project
data: Raw data files.
munge: Preprocessing data munging code, the outputs of which are put in cache.
src: Statistical analysis scripts.
diagnostics: Scripts to diagnose data sets for corruption or outliers.
doc: Documentation written about the analysis.
graphs: Graphs created from analysis.
lib: Helper library functions but not the core statistical analysis.
logs: Output of scripts and any automatic logging.
profiling: Scripts to benchmark the timing of your code.
reports: Output reports and content that might go into reports such as tables.
tests: Unit tests and regression suite for your code.
README: Notes that orient any newcomers to the project.
TODO: list of future improvements and bug fixes you plan to make.

You can learn more on the ProjectTemplate homepage, the blog post on John’s website theGitHub page for development and the CRAN page for distribution.

Data Management

Software Carpentry provides a short presentation titled Data Management. The approach to data management is inspired by an article by William Stafford Noble titled A Quick Guide to Organizing Computational Biology Projects.

The presentation describes problems with maintaining multiple versions of data on disk or in version control. It comments that the main requirement in data archiving and proposes an approach of dated directory names and data file metadata files that are themselves managed in version control. It’s an interesting approach.

You can review the video and slides for the presentation here.

Best Practices

There is a lot of discussion of best practices for project layout and code organization for data analysis projects on question and answer sites. For example, some popular examples include:

How Do You Manage Your Files & Directories For Your Projects?
Workflow for statistical analysis and report writing
Project Organization with R
What are efficient ways to organize R code and output?

A good example is the question How to efficiently manage a statistical analysis project? which was turned into a community wiki describing the best practices. In summary, these practices are divided into the following sections:

Data management: Use a directory structure, never modify raw data directly, check data consistency, use GNU make.
Coding: Organize code into functional units, document everything, custom functions in a dedicated file.
Analysis: Record your random seeds, separate parameters into config files, use multivariate plots
Versioning: use version control, backup everything, use an issue tracker.
Editing/Reporting: Combine code and reporting and use formal report generators.

More Practices

Each project I try to refine my project layout. It’s hard because the projects vary with data and aims as do the languages and tools. I’ve tried all compiled code and all scripting language versions. Some good tips I’ve discovered include:

Stick to a POSIX filesystem layout (var, etc, bin, lib, so on).
Put all commands in scripts.
Call all scripts from GNU make targets.
Have make targets that create environment and download public datasets.
Create recipes and let the infrastructure check and create any missing output products each run.

This last point is a game changer. It allows you to pipeline your workflow and define recipes with wild abandon for tasks like data analysis, preprocessing, model configuration, feature selection, etc. The framework knows how to execute recipes and creates results for you to review. I’ve talked about this approach before.

微信名：
HadoopSummit

微信ID：
hadoopinchina

中国Hadoop技术峰会是亚太地区举办最早、规模最大、影响力最广阔的大数据盛会。
Chinahadoop.com是China Hadoop Summit的内容网站。

HadoopSummit是Chinahadoop.com的微信发布平台。

标签：机器学习规划，项目管理

如何规划和管理你的机器学习项目

您可能还喜欢...

发表回复取消回复

分类

归档

其他操作

如何规划和管理你的机器学习项目

Workflow Motivating Questions

Goals for Project Workflow

ProjectTemplate

Data Management

Best Practices

More Practices

相关文章

您可能还喜欢...

机器学习与统计学是互补的吗？

值得mark的11个开源机器学习项目

有趣的机器学习

发表回复 取消回复

分类

归档

标签

其他操作

发表回复取消回复