CRISP data mining process

CRISP Data Mining Process is a process model with six phases that naturally describes the Data Science Life Cycle. It will help you plan, organize and implement your data science(or machine learning) project.

I: The data mining process

  • Business understanding – What does the business need?
  • Data understanding – What data do we have / need? Is it clean?
  • Data preparation – How do we organize the data for modeling?
  • Modeling – What modeling techniques should we apply?
  • Evaluation – Which model best meets the business objectives?
  • Deployment – How do stakeholders access the results?
The Most Common Methodology.jpeg

Data science teams that combine a loose implementation of CRISP-DM with overarching team-based agile project management approaches will likely see the best results. Even teams that don’t explicitly follow CRISP-DM, can still use the framework diagram to explain how the differences between data science and software projects.

II What are the 6 CRISP-DM Phases

2.1 Business Understanding

The Business Understanding phase focuses on understanding the objectives and requirements of the project. Aside from the third task, the three other tasks in this phase are foundational project management activities that are universal to most projects

    1. Determine business objectives: You should first “thoroughly understand, from a business perspective, what the customer really wants to accomplish.” (CRISP-DM Guide) and then define business success criteria.
    1. Assess situation: Determine resources availability, project requirements, assess risks and contingencies, and conduct a cost-benefit analysis.
    1. Determine data mining goals: In addition to defining the business objectives, you should also define what success looks like from a technical data mining perspective.
    1. Produce project plan: Select technologies and tools and define detailed plans for each project phase.

    2.2 Data Understanding

Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase also has four tasks:

  • Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.

  • Describe data: Examine the data and document its surface properties like data format, number of records, or field identities.

  • Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.

  • Verify data quality: How clean/dirty is the data?

2.3 Data Preparetion

This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. It has five tasks:

  • Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.

  • Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A common practice during this task is to correct, impute, or remove erroneous values.

  • Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index from height and weight fields.

  • Integrate data: Create new data sets by combining data from multiple sources.

  • Format data: Re-format data as necessary. For example, you might convert string values that store numbers to numeric values so that you can perform mathematical operations.

2.4. Modeling

e.g
Unsupervised & supervised tasks:
Classification & probability estimation; Regression; Similarity matching; Clustering; Co-occurrence grouping; Profiling; Link prediction; Data reduction; Causal modelling

Here you’ll likely build and assess various models based on several different modeling techniques. This phase has four tasks:

  • Select modeling techniques: Determine which algorithms to try (e.g. regression, neural net).
  • Generate test design: Pending your modeling approach, you might need to split the data into training, test, and validation sets.
  • Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg = LinearRegression().fit(X, y)”.
  • Assess model: Generally, multiple models are competing against each other, and the data scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.

2.5 Evaluation

Whereas the Assess Model task of the Modeling phase focuses on technical model assessment, the Evaluation phase looks more broadly at which model best meets the business and what to do next. This phase has three tasks:

  • Evaluate results: Do the models meet the business success criteria? Which one(s) should we approve for the business?
  • Review process: Review the work accomplished. Was anything overlooked? Were all steps properly executed? Summarize findings and correct anything if needed.
  • Determine next stepsBased on the previous three tasks, determine whether to proceed to deployment, iterate further, or initiate new projects.

2.6 Deployment

A model is not particularly useful unless the customer can access its results. The complexity of this phase varies widely. This final phase has four tasks:

  • Plan deployment: Develop and document a plan for deploying the model.
  • Plan monitoring and maintenance: Develop a thorough monitoring and maintenance plan to avoid issues during the operational phase (or post-project phase) of a model.
  • **Produce final report: **The project team documents a summary of the project which might include a final presentation of data mining results.
  • Review project: Conduct a project retrospective about what went well, what could have been better, and how to improve in the future.

你可能感兴趣的:(CRISP data mining process)