zengzhaozheng

基于MapReduce的ItemBase推荐算法的共现矩阵实现（一）

一、概述

这2个月研究根据用户标签情况对用户的相似度进行评估，其中涉及一些推荐算法知识，在这段时间研究了一遍《推荐算法实践》和《Mahout in action》，在这里主要是根据这两本书的一些思想和自己的一些理解对分布式基于ItemBase的推荐算法进行实现。其中分两部分，第一部分是根据共现矩阵的方式来简单的推算出用户的推荐项，第二部分则是通过传统的相似度矩阵的方法来实践ItemBase推荐算法。这篇blog主要记录第一部分的内容，并且利用MapReduce进行实现，下一篇blog则是记录第二部分的内容和实现。

二、算法原理

协同推荐算法，作为众多推荐算法中的一种已经被广泛的应用。其主要分为2种，第一种就是基于用户的协同过滤，第二种就是基于物品的协同过滤。

所谓的itemBase推荐算法简单直白的描述就是：用户A喜欢物品X1，用户B喜欢物品X2，如果X1和X2相似则，将A之前喜欢过的物品推荐给B，或者B之前喜欢过的物品推荐给A。这种算法是完全依赖于用户的历史喜欢物品的；所谓的UserBase推荐算法直白地说就是：用户A喜欢物品X1，用户B喜欢物品X2，如果用户A和用户B相似则将物品X1推荐给用户B，将物品X2推荐给用户A。简单的示意图：

至于选择哪种要看自己的实际情况，如果用户量比物品种类多得多那么就采用ItemBase的协同过滤推荐算法，如果是用户量比物品种类少的多则采用UserBase的协同顾虑推荐算，这样选择的一个原因是为了让物品的相似度矩阵或者用户相似度矩阵或者共现矩阵的规模最小化。

三、数据建模

基本的算法上面已经大概说了一下，对于算法来说，对数据建模使之运用在算法之上是重点也是难点。这小节主要根据自己相关项目的经验和《推荐引擎实践》的一些观点来讨论一些。分开2部分说，一是根据共现矩阵推荐、而是根据相似度算法进行推荐。

(1)共现矩阵方式：

第一步：转换成用户向量

1[102:0.1,103:0.2,104:0.3]：表示用户1喜欢的物品列表，以及他们对应的喜好评分。

2[101:0.1,102:0.7,105:0.9]：表示用户2喜欢的物品列表，以及他们对应的喜好评分。

3[102:0.1,103:0.7,104:0.2]：表示用户3喜欢的物品列表，以及他们对应的喜好评分。

第二步：计算共现矩阵

简单地说就是将同时喜欢物品x1和x2的用户数组成矩阵。

第三步：

生成用户对物品的评分矩阵

第四步：物品共现矩阵和用户对物品的评分矩阵相乘得到推荐结果

举个例子计算用户1的推荐列表过程：

用户1对物品101的总评分计算：

1*0+1*0.1+0*0.2+0*0.3+1*0=0.1

用户1对物品102的总评分计算：

1*0+3*0.1+1*0.2+2*0.3+2*0=1.1

用户1对物品103的总评分计算：

0*0+1*0.1+1*0.2+1*0.3+0*0=0.6

用户1对物品104的总评分计算：

0*0+2*0.1+1*0.2+2*0.3+1*0=1.0

用户1对物品105的总评分计算：

1*0+2*0.1+0*0.2+1*0.3+2*0=0.5

从而得到用户1的推荐列表为1[101:0.1,102:1.1,103:0.6,104:1.0,105:0.5]再经过排序得到最终推荐列表1[102:1.1,104:1.0,103:0.6,105:0.5,101:0.1]。

(2)通过计算机物品相似度方式计算用户的推荐向量。

通过计算机物品相似度方式计算用户的推荐向量和上面通过共现矩阵的方式差不多，就是将物品相似度矩阵代替掉共现矩阵和用户对物品的评分矩阵相乘，然后在计算推荐向量。

计算相似度矩阵：

在计算之前我们先了解一下物品相似度相关的计算方法。

对于计算物品相似度的算法有很多，要根据自己的数据模型进行选择。基于皮尔逊相关系数计算、欧几里德定理（实际上是算两点距离）、基于余弦相似度计算斯皮尔曼相关系数计算、基于谷本系数计算、基于对数似然比计算。其中谷本系数和对数似然比这两种方式主要是针对那些没有指名对物品喜欢度的数据模型进行相似度计算，也就是mahout中所指的Boolean数据模型。下面主要介绍2种，欧几里德和余弦相似度算法。

现在关键是怎么将现有数据转化成对应的空间向量模型使之适用这些定理，这是个关键点。下面我以欧几里德定理作为例子看看那如何建立模型：

第一步：将用户向量转化为物品向量

用户向量：

1[102:0.1,103:0.2,104:0.3]

2[101:0.1,102:0.7,105:0.9]

3[102:0.1,103:0.7,104:0.2]

转为为物品向量：

101[2:0.1]

102[1:0.1,2:0.7,3:0.1]

103[1:0.2,3:0.7]

104[1:0.3,3:0.2]

105[2:0.9]

第二步：

那么物品相似度计算为：

第三步：

最终得到物品相似度矩阵为：(这里省略掉没有意义的自关联相似度)

第四步：物品相似度矩阵和用户对物品的评分矩阵相乘得到推荐结果：

举个例子计算用户1的类似推荐列表过程：

用户1对物品101的总评分计算：

1*0+1*0.6186429+0*0.6964322+0*0.7277142+1*0.55555556=1.174198

用户1对物品102的总评分计算：

1*0.6186429+3*0+1*0.5188439+2*0.5764197+2*0.8032458=3.896818

用户1对物品103的总评分计算：

0*0.6964322+1*0.5188439+1*0+1*0.662294+0*0.463481=1.181138

用户1对物品104的总评分计算：

0*0.7277142+2*0.5764197+1*0.662294+2*0+1*0.5077338=2.322867

用户1对物品105的总评分计算：

1*0.55555556+2*0.8032458+0*0.463481+1*0.5077338=2.669780

四、共现矩阵方式的MapReduce实现

这里主要是利用MapReduce结合Mahout连的一些数据类型对共现矩阵方式的推荐方法进行实现,至于相似度矩阵方式进行推荐的在下一篇blog写。这里采用Boolean数据模型，即用户是没有对喜欢的物品进行初始打分的，我们在程序中默认都为1。

先看看整个MapReduce的数据流向图：

具体代码实现：HadoopUtil

package com.util;

import java.io.IOException;
import java.util.Arrays;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.OutputFormat;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public final class HadoopUtil {

  private static final Logger log = LoggerFactory.getLogger(HadoopUtil.class);

  private HadoopUtil() { }

  public static Job prepareJob(String jobName,
		                   String[] inputPath,
		                   String outputPath,
                           Class<? extends InputFormat> inputFormat,
                           Class<? extends Mapper> mapper,
                           Class<? extends Writable> mapperKey,
                           Class<? extends Writable> mapperValue,
                           Class<? extends OutputFormat> outputFormat, Configuration conf) throws IOException {

    Job job = new Job(new Configuration(conf)); 
    job.setJobName(jobName);
    Configuration jobConf = job.getConfiguration();

    if (mapper.equals(Mapper.class)) {
      throw new IllegalStateException("Can't figure out the user class jar file from mapper/reducer");
    }
    job.setJarByClass(mapper);

    job.setInputFormatClass(inputFormat);
    job.setInputFormatClass(inputFormat);
    StringBuilder inputPathsStringBuilder =new StringBuilder();
    for(String p : inputPath){
    	inputPathsStringBuilder.append(",").append(p);
    }
    inputPathsStringBuilder.deleteCharAt(0);
    jobConf.set("mapred.input.dir", inputPathsStringBuilder.toString());

    job.setMapperClass(mapper);
    job.setMapOutputKeyClass(mapperKey);
    job.setMapOutputValueClass(mapperValue);
    job.setOutputKeyClass(mapperKey);
    job.setOutputValueClass(mapperValue);
    jobConf.setBoolean("mapred.compress.map.output", true);
    job.setNumReduceTasks(0);

    job.setOutputFormatClass(outputFormat);
    jobConf.set("mapred.output.dir", outputPath);

    return job;
  }

  public static Job prepareJob(String jobName,
		  				   String[] inputPath,
		                   String outputPath,
                           Class<? extends InputFormat> inputFormat,
                           Class<? extends Mapper> mapper,
                           Class<? extends Writable> mapperKey,
                           Class<? extends Writable> mapperValue, 
                           Class<? extends Reducer> reducer,
                           Class<? extends Writable> reducerKey,
                           Class<? extends Writable> reducerValue,
                           Class<? extends OutputFormat> outputFormat,
                           Configuration conf) throws IOException {

    Job job = new Job(new Configuration(conf));
    job.setJobName(jobName);
    Configuration jobConf = job.getConfiguration();

    if (reducer.equals(Reducer.class)) {
      if (mapper.equals(Mapper.class)) {
        throw new IllegalStateException("Can't figure out the user class jar file from mapper/reducer");
      }
      job.setJarByClass(mapper);
    } else {
      job.setJarByClass(reducer);
    }

    job.setInputFormatClass(inputFormat);
    StringBuilder inputPathsStringBuilder =new StringBuilder();
    for(String p : inputPath){
    	inputPathsStringBuilder.append(",").append(p);
    }
    inputPathsStringBuilder.deleteCharAt(0);
    jobConf.set("mapred.input.dir", inputPathsStringBuilder.toString());

    job.setMapperClass(mapper);
    if (mapperKey != null) {
      job.setMapOutputKeyClass(mapperKey);
    }
    if (mapperValue != null) {
      job.setMapOutputValueClass(mapperValue);
    }

    jobConf.setBoolean("mapred.compress.map.output", true);

    job.setReducerClass(reducer);
    job.setOutputKeyClass(reducerKey);
    job.setOutputValueClass(reducerValue);

    job.setOutputFormatClass(outputFormat);
    jobConf.set("mapred.output.dir", outputPath);

    return job;
  }
  
  
  
  
	public static Job prepareJob(String jobName, String[] inputPath,
			String outputPath, Class<? extends InputFormat> inputFormat,
			Class<? extends Mapper> mapper,
			Class<? extends Writable> mapperKey,
			Class<? extends Writable> mapperValue,
			Class<? extends Reducer> combiner,
			Class<? extends Reducer> reducer,
			Class<? extends Writable> reducerKey,
			Class<? extends Writable> reducerValue,
			Class<? extends OutputFormat> outputFormat, Configuration conf)
			throws IOException {

		Job job = new Job(new Configuration(conf));
		job.setJobName(jobName);
		Configuration jobConf = job.getConfiguration();

		if (reducer.equals(Reducer.class)) {
			if (mapper.equals(Mapper.class)) {
				throw new IllegalStateException(
						"Can't figure out the user class jar file from mapper/reducer");
			}
			job.setJarByClass(mapper);
		} else {
			job.setJarByClass(reducer);
		}

		job.setInputFormatClass(inputFormat);
		StringBuilder inputPathsStringBuilder = new StringBuilder();
		for (String p : inputPath) {
			inputPathsStringBuilder.append(",").append(p);
		}
		inputPathsStringBuilder.deleteCharAt(0);
		jobConf.set("mapred.input.dir", inputPathsStringBuilder.toString());

		job.setMapperClass(mapper);
		if (mapperKey != null) {
			job.setMapOutputKeyClass(mapperKey);
		}
		if (mapperValue != null) {
			job.setMapOutputValueClass(mapperValue);
		}

		jobConf.setBoolean("mapred.compress.map.output", true);

		job.setCombinerClass(combiner);
		
		job.setReducerClass(reducer);
		job.setOutputKeyClass(reducerKey);
		job.setOutputValueClass(reducerValue);

		job.setOutputFormatClass(outputFormat);
		jobConf.set("mapred.output.dir", outputPath);

		return job;
	}

  public static String getCustomJobName(String className, JobContext job,
                                  Class<? extends Mapper> mapper,
                                  Class<? extends Reducer> reducer) {
    StringBuilder name = new StringBuilder(100);
    String customJobName = job.getJobName();
    if (customJobName == null || customJobName.trim().isEmpty()) {
      name.append(className);
    } else {
      name.append(customJobName);
    }
    name.append('-').append(mapper.getSimpleName());
    name.append('-').append(reducer.getSimpleName());
    return name.toString();
  }


  public static void delete(Configuration conf, Iterable<Path> paths) throws IOException {
    if (conf == null) {
      conf = new Configuration();
    }
    for (Path path : paths) {
      FileSystem fs = path.getFileSystem(conf);
      if (fs.exists(path)) {
        log.info("Deleting {}", path);
        fs.delete(path, true);
      }
    }
  }

  public static void delete(Configuration conf, Path... paths) throws IOException {
    delete(conf, Arrays.asList(paths));
  }

  public static long countRecords(Path path, Configuration conf) throws IOException {
    long count = 0;
    Iterator<?> iterator = new SequenceFileValueIterator<Writable>(path, true, conf);
    while (iterator.hasNext()) {
      iterator.next();
      count++;
    }
    return count;
  }

  public static long countRecords(Path path, PathType pt, PathFilter filter, Configuration conf) throws IOException {
    long count = 0;
    Iterator<?> iterator = new SequenceFileDirValueIterator<Writable>(path, pt, filter, null, true, conf);
    while (iterator.hasNext()) {
      iterator.next();
      count++;
    }
    return count;
  }
}

先看看写的工具类：

第一步：处理原始输入数据

处理原始数据的SourceDataToItemPrefsJob作业的mapper：SourceDataToItemPrefsMapper

package com.mapper;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.mahout.math.VarLongWritable;


/**
 * mapper输入格式：userID:itemID1 itemID2 itemID3....
 * mapper输出格式:<userID,itemID>
 * @author 曾昭正
 */
public class SourceDataToItemPrefsMapper extends Mapper<LongWritable, Text, VarLongWritable, VarLongWritable>{
	//private static final Logger logger = LoggerFactory.getLogger(SourceDataToItemPrefsMapper.class);
	private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
	private String line = null;
	
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
		 line = value.toString();
		 if(line == null) return ;
		// logger.info("line:"+line);
		 Matcher matcher = NUMBERS.matcher(line);
		 matcher.find();//寻找第一个分组，即userID
		 VarLongWritable userID = new VarLongWritable(Long.parseLong(matcher.group()));//这个类型是在mahout中独立进行封装的
		 VarLongWritable itemID = new VarLongWritable();
		 while(matcher.find()){
			 itemID.set(Long.parseLong(matcher.group()));
		//	 logger.info(userID + " " + itemID);
			 context.write(userID, itemID);
		 }
	}
}

处理原始数据的SourceDataToItemPrefsJob作业的reducer：SourceDataToItemPrefsMapper

package com.reducer;

import java.io.IOException;

import org.apache.hadoop.mapreduce.Reducer;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * reducer输入：<userID,Iterable<itemID>>
 * reducer输出:<userID,VecotrWriable<index=itemID,valuce=pres>....>
 * @author 曾昭正
 */
public class SourceDataToUserVectorReducer extends Reducer<VarLongWritable, VarLongWritable, VarLongWritable, VectorWritable>{
	private static final Logger logger = LoggerFactory.getLogger(SourceDataToUserVectorReducer.class);
	@Override
	protected void reduce(VarLongWritable userID, Iterable<VarLongWritable> itemPrefs,Context context)
			throws IOException, InterruptedException {
		/**
		 *  DenseVector，它的实现就是一个浮点数数组，对向量里所有域都进行存储，适合用于存储密集向量。
			RandomAccessSparseVector 基于浮点数的 HashMap 实现的，key 是整形 (int) 类型，value 是浮点数 (double) 类型，它只存储向量中不为空的值，并提供随机访问。
			SequentialAccessVector 实现为整形 (int) 类型和浮点数 (double) 类型的并行数组，它也只存储向量中不为空的值，但只提供顺序访问。
			用户可以根据自己算法的需求选择合适的向量实现类，如果算法需要很多随机访问，应该选择 DenseVector 或者 RandomAccessSparseVector，如果大部分都是顺序访问，SequentialAccessVector 的效果应该更好。
			介绍了向量的实现，下面我们看看如何将现有的数据建模成向量，术语就是“如何对数据进行向量化”，以便采用 Mahout 的各种高效的聚类算法。
		 */
		Vector userVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
		for(VarLongWritable itemPref : itemPrefs){
			userVector.set((int)itemPref.get(), 1.0f);//RandomAccessSparseVector.set(index,value),用户偏好类型为boolean类型，将偏好值默认都为1.0f
		}
		logger.info(userID+" "+new VectorWritable(userVector));
		context.write(userID, new VectorWritable(userVector));
	}
}

第二步：将SourceDataToItemPrefsJob作业的reduce输出结果组合成共现矩阵

UserVectorToCooccurrenceJob作业的mapper：UserVectorToCooccurrenceMapper

package com.mapper;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;

/**
 * mapper输入：<userID,VecotrWriable<index=itemID,valuce=pres>....>
 * mapper输出:<itemID,itemID>(共现物品id对)
 * @author 曾昭正
 */
public class UserVectorToCooccurrenceMapper extends Mapper<VarLongWritable, VectorWritable, IntWritable, IntWritable>{
	@Override
	protected void map(VarLongWritable userID, VectorWritable userVector,Context context)
			throws IOException, InterruptedException {
		Iterator<Vector.Element> it = userVector.get().nonZeroes().iterator();//过滤掉非空元素
		while(it.hasNext()){
			int index1 = it.next().index();
			Iterator<Vector.Element> it2 = userVector.get().nonZeroes().iterator();
			while(it2.hasNext()){
				int index2  = it2.next().index();
				context.write(new IntWritable(index1), new IntWritable(index2));
			}
		}
		
	}
}

UserVectorToCooccurrenceJob作业的reducer：UserVectorToCoocurrenceReducer

package com.reducer;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.mahout.cf.taste.hadoop.item.VectorOrPrefWritable;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
 * reducer输入:<itemID,Iterable<itemIDs>>
 * reducer输出:<mainItemID,Vector<coocItemID,coocTime(共现次数)>....>
 * @author 曾昭正
 */
public class UserVectorToCoocurrenceReducer extends Reducer<IntWritable, IntWritable, IntWritable, VectorOrPrefWritable>{
	private static final Logger logger = LoggerFactory.getLogger(UserVectorToCoocurrenceReducer.class);
	@Override
	protected void reduce(IntWritable mainItemID, Iterable<IntWritable> coocItemIDs,Context context)
			throws IOException, InterruptedException {
		Vector coocItemIDVectorRow = new RandomAccessSparseVector(Integer.MAX_VALUE,100);
		for(IntWritable coocItem : coocItemIDs){
			int itemCoocTime = coocItem.get();
			coocItemIDVectorRow.set(itemCoocTime,coocItemIDVectorRow.get(itemCoocTime)+1.0);//将共现次数累加
		}
		logger.info(mainItemID +" "+new VectorOrPrefWritable(coocItemIDVectorRow));
		context.write(mainItemID, new VectorOrPrefWritable(coocItemIDVectorRow));//记录mainItemID的完整共现关系
	}
}

第三步：将SourceDataToItemPrefsJob作业的reduce输出结果进行分割

userVecotrSplitJob作业的mapper：UserVecotrSplitMapper

package com.mapper;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.mahout.cf.taste.hadoop.item.VectorOrPrefWritable;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.Vector.Element;
import org.apache.mahout.math.VectorWritable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;


/**
 * 将用户向量分割，以便和物品的共现向量进行合并
 * mapper输入:<userID,Vector<itemIDIndex,preferenceValuce>....>
 * reducer输出:<itemID,Vecotor<userID,preferenceValuce>....> 
 * @author 曾昭正
 */
public class UserVecotrSplitMapper extends Mapper<VarLongWritable, VectorWritable, IntWritable, VectorOrPrefWritable>{
	private static final Logger logger = LoggerFactory.getLogger(UserVecotrSplitMapper.class);
	@Override
	protected void map(VarLongWritable userIDWritable, VectorWritable value,Context context)
			throws IOException, InterruptedException {
		IntWritable itemIDIndex = new IntWritable();
		long userID = userIDWritable.get();
		Vector userVector = value.get();
		Iterator<Element> it = userVector.nonZeroes().iterator();//只取非空用户向量
		while(it.hasNext()){
			Element e = it.next();
			int itemID = e.index();
			itemIDIndex.set(itemID);
			float preferenceValuce = (float) e.get();
			logger.info(itemIDIndex +" "+new VectorOrPrefWritable(userID,preferenceValuce));
			context.write(itemIDIndex, new VectorOrPrefWritable(userID,preferenceValuce));
		}
		
	}
}

第四步：将userVecotrSplitJob和UserVectorToCooccurrenceJob作业的输出结果合并

combineUserVectorAndCoocMatrixJob作业的mapper：CombineUserVectorAndCoocMatrixMapper

package com.mapper;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.mahout.cf.taste.hadoop.item.VectorOrPrefWritable;

/**
 * 将共现矩阵和分割后的用户向量进行合并，以便计算部分的推荐向量
 * 这个mapper其实没有什么逻辑处理功能，只是将数据按照输入格式输出
 * 注意：这里的mapper输入为共现矩阵和分割后的用户向量计算过程中的共同输出的2个目录
 * mapper输入：<itemID,Vecotor<userID,preferenceValuce>> or <itemID,Vecotor<coocItemID,coocTimes>>
 * mapper输出:<itemID,Vecotor<userID,preferenceValuce>/Vecotor<coocItemID,coocTimes>>
 * @author 曾昭正
 */
public class CombineUserVectorAndCoocMatrixMapper extends Mapper<IntWritable, VectorOrPrefWritable, IntWritable, VectorOrPrefWritable>{
	@Override
	protected void map(IntWritable itemID, VectorOrPrefWritable value,Context context)
			throws IOException, InterruptedException {
		context.write(itemID, value);
	}

}

combineUserVectorAndCoocMatrixJob作业的CombineUserVectorAndCoocMatrixReducer

package com.reducer;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.mahout.cf.taste.hadoop.item.VectorAndPrefsWritable;
import org.apache.mahout.cf.taste.hadoop.item.VectorOrPrefWritable;
import org.apache.mahout.math.Vector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * 将共现矩阵和分割后的用户向量进行合并，以便计算部分的推荐向量
 * @author 曾昭正
 */
public class CombineUserVectorAndCoocMatrixReducer extends Reducer<IntWritable, VectorOrPrefWritable, IntWritable, VectorAndPrefsWritable>{
	private static final Logger logger = LoggerFactory.getLogger(CombineUserVectorAndCoocMatrixReducer.class);
	@Override
	protected void reduce(IntWritable itemID, Iterable<VectorOrPrefWritable> values,Context context)
			throws IOException, InterruptedException {
		VectorAndPrefsWritable vectorAndPrefsWritable = new VectorAndPrefsWritable();
		List<Long> userIDs = new ArrayList<Long>();
		List<Float> preferenceValues = new ArrayList<Float>();
		Vector coocVector = null;
		Vector coocVectorTemp = null;
		Iterator<VectorOrPrefWritable> it = values.iterator();
		while(it.hasNext()){
			VectorOrPrefWritable e = it.next();
			coocVectorTemp = e.getVector() ;
			if(coocVectorTemp == null){
				userIDs.add(e.getUserID());
				preferenceValues.add(e.getValue());
			}else{
				coocVector = coocVectorTemp;
			}
		}
		if(coocVector != null){
			//这个需要注意，根据共现矩阵的计算reduce聚合之后，到了这个一个Reudce分组就有且只有一个vecotr(即共现矩阵的一列或者一行，这里行和列是一样的)了。
			vectorAndPrefsWritable.set(coocVector, userIDs, preferenceValues);
			logger.info(itemID+" "+vectorAndPrefsWritable);
			context.write(itemID, vectorAndPrefsWritable);
		}
	}
}

第五步：将combineUserVectorAndCoocMatrixJob作业的输出结果生成推荐列表

caclPartialRecomUserVectorJob作业的mapper：CaclPartialRecomUserVectorMapper

package com.mapper;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.mahout.cf.taste.hadoop.item.VectorAndPrefsWritable;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * 计算部分用户推荐向量
 * @author 曾昭正
 */
public class CaclPartialRecomUserVectorMapper extends Mapper<IntWritable,VectorAndPrefsWritable, VarLongWritable, VectorWritable>{
	private static final Logger logger = LoggerFactory.getLogger(CaclPartialRecomUserVectorMapper.class);
	@Override
	protected void map(IntWritable itemID, VectorAndPrefsWritable values,Context context)
			throws IOException, InterruptedException {
		Vector coocVectorColumn = values.getVector();
		List<Long> userIDs = values.getUserIDs();
		List<Float> preferenceValues = values.getValues();
		for(int i = 0; i< userIDs.size(); i++){
			long userID = userIDs.get(i);
			float preferenceValue = preferenceValues.get(i);
			logger.info("userID:" + userID);
			logger.info("preferenceValue:"+preferenceValue);
			//将共现矩阵中userID对应的列相乘，算出部分用户对应的推荐列表分数
			Vector preferenceParScores = coocVectorColumn.times(preferenceValue);
			context.write(new VarLongWritable(userID), new VectorWritable(preferenceParScores));
		}
	}
}

caclPartialRecomUserVectorJob作业的combiner：ParRecomUserVectorCombiner

package com.reducer;

import java.io.IOException;

import org.apache.hadoop.mapreduce.Reducer;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
 * 将计算部分用户推荐向量的结果进行合并，将userID对应的贡现向量的分值进行相加(注意：这个只是将一个map的输出进行合并，所以这个也是只部分结果)
 * @author 曾昭正
 */
public class ParRecomUserVectorCombiner extends Reducer<VarLongWritable, VectorWritable, VarLongWritable, VectorWritable>{
	private static final Logger logger = LoggerFactory.getLogger(ParRecomUserVectorCombiner.class);
	@Override
	protected void reduce(VarLongWritable userID, Iterable<VectorWritable> coocVectorColunms,Context context)
			throws IOException, InterruptedException {
			
		Vector vectorColunms = null;
		
		for(VectorWritable  coocVectorColunm : coocVectorColunms){
			vectorColunms = vectorColunms == null ? coocVectorColunm.get() : vectorColunms.plus(coocVectorColunm.get());
		}
		logger.info(userID +" " + new VectorWritable(vectorColunms));
		context.write(userID, new VectorWritable(vectorColunms));
	}
}

caclPartialRecomUserVectorJob作业的reducer：MergeAndGenerateRecommendReducer

package com.reducer;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
import java.util.PriorityQueue;
import java.util.Queue;

import org.apache.hadoop.mapreduce.Reducer;
import org.apache.mahout.cf.taste.hadoop.RecommendedItemsWritable;
import org.apache.mahout.cf.taste.impl.recommender.ByValueRecommendedItemComparator;
import org.apache.mahout.cf.taste.impl.recommender.GenericRecommendedItem;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.Vector.Element;
import org.apache.mahout.math.VectorWritable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * 合并所有已经评分的共现矩阵
 * @author 曾昭正
 */
public class MergeAndGenerateRecommendReducer extends Reducer<VarLongWritable, VectorWritable, VarLongWritable, RecommendedItemsWritable>{
	private static final Logger logger = LoggerFactory.getLogger(MergeAndGenerateRecommendReducer.class);
	private int recommendationsPerUser;
	@Override
	protected void setup(Context context)
			throws IOException, InterruptedException {
		recommendationsPerUser = context.getConfiguration().getInt("recomandItems.recommendationsPerUser", 5);
	}
	@Override
	protected void reduce(VarLongWritable userID, Iterable<VectorWritable> cooVectorColumn,Context context)
			throws IOException, InterruptedException {
		//分数求和合并
		Vector recommdVector = null;
		for(VectorWritable vector : cooVectorColumn){
			recommdVector = recommdVector == null ? vector.get() : recommdVector.plus(vector.get());
		}
		//对推荐向量进行排序，为每个UserID找出topM个推荐选项(默认找出5个)，此队列按照item对应的分数进行排序
		//注意下：PriorityQueue队列的头一定是最小的元素,另外这个队列容量增加1是为了为添加更大的新元素时使用的临时空间
		Queue<RecommendedItem> topItems = new PriorityQueue<RecommendedItem>(recommendationsPerUser+1, ByValueRecommendedItemComparator.getInstance());
		
		Iterator<Element> it = recommdVector.nonZeroes().iterator();
		while(it.hasNext()){
			Element e = it.next();
			int itemID = e.index();
			float preValue = (float) e.get();
			//当队列容量小于推荐个数，往队列中填item和分数
			if(topItems.size() < recommendationsPerUser){
				topItems.add(new GenericRecommendedItem(itemID, preValue));
			}
			//当前item对应的分数比队列中的item的最小分数大，则将队列头原始（即最小元素）弹出，并且将当前item：分数加入队列
			else if(preValue > topItems.peek().getValue()){
				topItems.add(new GenericRecommendedItem(itemID, preValue));
				//弹出头元素（最小元素）
				topItems.poll();
			}
		}
		
		//重新调整队列的元素的顺序
		List<RecommendedItem> recommdations = new ArrayList<RecommendedItem>(topItems.size());
		recommdations.addAll(topItems);//将队列中所有元素添加即将排序的集合
		Collections.sort(recommdations,ByValueRecommendedItemComparator.getInstance());//排序
		
		//输出推荐向量信息
		logger.info(userID+" "+ new RecommendedItemsWritable(recommdations));
		context.write(userID, new RecommendedItemsWritable(recommdations));
		
	}
}

第六步：组装各个作业关系

PackageRecomendJob

package com.mapreduceMain;

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.cf.taste.hadoop.RecommendedItemsWritable;
import org.apache.mahout.cf.taste.hadoop.item.VectorAndPrefsWritable;
import org.apache.mahout.cf.taste.hadoop.item.VectorOrPrefWritable;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.VectorWritable;
import com.mapper.CaclPartialRecomUserVectorMapper;
import com.mapper.CombineUserVectorAndCoocMatrixMapper;
import com.mapper.UserVecotrSplitMapper;
import com.mapper.UserVectorToCooccurrenceMapper;
import com.mapper.SourceDataToItemPrefsMapper;
import com.reducer.CombineUserVectorAndCoocMatrixReducer;
import com.reducer.MergeAndGenerateRecommendReducer;
import com.reducer.ParRecomUserVectorCombiner;
import com.reducer.UserVectorToCoocurrenceReducer;
import com.reducer.SourceDataToUserVectorReducer;
import com.util.HadoopUtil;


/**
 * 组装各个作业组件，完成推荐作业
 * @author 曾昭正
 */
public class PackageRecomendJob extends Configured implements Tool{
	String[] dataSourceInputPath = {"/user/hadoop/z.zeng/distruteItemCF/dataSourceInput"};
	String[] uesrVectorOutput = {"/user/hadoop/z.zeng/distruteItemCF/uesrVectorOutput/"};
	String[] userVectorSpliltOutPut = {"/user/hadoop/z.zeng/distruteItemCF/userVectorSpliltOutPut"};
	String[] cooccurrenceMatrixOuptPath = {"/user/hadoop/z.zeng/distruteItemCF/CooccurrenceMatrixOuptPath"};
	String[] combineUserVectorAndCoocMatrixOutPutPath = {"/user/hadoop/z.zeng/distruteItemCF/combineUserVectorAndCoocMatrixOutPutPath"};
	String[] caclPartialRecomUserVectorOutPutPath = {"/user/hadoop/z.zeng/distruteItemCF/CaclPartialRecomUserVectorOutPutPath"};
	
	protected void setup(Configuration configuration)
			throws IOException, InterruptedException {
		FileSystem hdfs = FileSystem.get(URI.create("hdfs://cluster-master"), configuration);
		Path p1 = new Path(uesrVectorOutput[0]);
		Path p2 = new Path(userVectorSpliltOutPut[0]);
		Path p3 = new Path(cooccurrenceMatrixOuptPath[0]);
		Path p4 = new Path(combineUserVectorAndCoocMatrixOutPutPath[0]);
		Path p5 = new Path(caclPartialRecomUserVectorOutPutPath[0]);

		if (hdfs.exists(p1)) {
			hdfs.delete(p1, true);
		} 
		if (hdfs.exists(p2)) {
			hdfs.delete(p2, true);
		} 
		if (hdfs.exists(p3)) {
			hdfs.delete(p3, true);
		} 
		if (hdfs.exists(p4)) {
			hdfs.delete(p4, true);
		} 
		if (hdfs.exists(p5)) {
			hdfs.delete(p5, true);
		}
	}
	@Override
	public int run(String[] args) throws Exception {
		  	Configuration conf=getConf(); //获得配置文件对象
		  	setup(conf);
		  //	DistributedCache.addArchiveToClassPath(new Path("/user/hadoop/z.zeng/distruteItemCF/lib"), conf);
		  //配置计算用户向量作业
		  Job wikipediaToItemPrefsJob = HadoopUtil.prepareJob(
				    "WikipediaToItemPrefsJob",
		  			dataSourceInputPath, 
		  			uesrVectorOutput[0], 
		  			TextInputFormat.class, 
		  			SourceDataToItemPrefsMapper.class, 
		  			VarLongWritable.class, 
		  			VarLongWritable.class, 
		  			SourceDataToUserVectorReducer.class, 
		  			VarLongWritable.class, 
		  			VectorWritable.class, 
		  			SequenceFileOutputFormat.class, 
		  			conf);
		  //配置计算共现向量作业
		  Job userVectorToCooccurrenceJob = HadoopUtil.prepareJob(
				    "UserVectorToCooccurrenceJob",
		  			uesrVectorOutput, 
		  			cooccurrenceMatrixOuptPath[0], 
		  			SequenceFileInputFormat.class, 
		  			UserVectorToCooccurrenceMapper.class, 
		  			IntWritable.class, 
		  			IntWritable.class, 
		  			UserVectorToCoocurrenceReducer.class, 
		  			IntWritable.class, 
		  			VectorOrPrefWritable.class, 
		  			SequenceFileOutputFormat.class, 
		  			conf);
		  //配置分割用户向量作业
		  Job userVecotrSplitJob = HadoopUtil.prepareJob(
				    "userVecotrSplitJob",
		  			uesrVectorOutput, 
		  			userVectorSpliltOutPut[0], 
		  			SequenceFileInputFormat.class, 
		  			UserVecotrSplitMapper.class, 
		  			IntWritable.class, 
		  			VectorOrPrefWritable.class, 
		  			SequenceFileOutputFormat.class, 
		  			conf);
		  //合并共现向量和分割之后的用户向量作业
		  //这个主意要将分割用户向量和共现向量的输出结果一起作为输入
		  String[] combineUserVectorAndCoocMatrixIutPutPath = {cooccurrenceMatrixOuptPath[0],userVectorSpliltOutPut[0]};
		  Job combineUserVectorAndCoocMatrixJob = HadoopUtil.prepareJob(
				    "combineUserVectorAndCoocMatrixJob",
				    combineUserVectorAndCoocMatrixIutPutPath,
		  			combineUserVectorAndCoocMatrixOutPutPath[0], 
		  			SequenceFileInputFormat.class, 
		  			CombineUserVectorAndCoocMatrixMapper.class, 
		  			IntWritable.class, 
		  			VectorOrPrefWritable.class, 
		  			CombineUserVectorAndCoocMatrixReducer.class, 
		  			IntWritable.class, 
		  			VectorAndPrefsWritable.class, 
		  			SequenceFileOutputFormat.class, 
		  			conf);
		  //计算用户推荐向量
		  Job caclPartialRecomUserVectorJob= HadoopUtil.prepareJob(
				    "caclPartialRecomUserVectorJob",
				    combineUserVectorAndCoocMatrixOutPutPath,
				    caclPartialRecomUserVectorOutPutPath[0], 
				    SequenceFileInputFormat.class, 
		  			CaclPartialRecomUserVectorMapper.class, 
		  			VarLongWritable.class, 
		  			VectorWritable.class, 
		  			ParRecomUserVectorCombiner.class,//为map设置combiner减少网络IO
		  			MergeAndGenerateRecommendReducer.class, 
		  			VarLongWritable.class, 
		  			RecommendedItemsWritable.class, 
		  			TextOutputFormat.class, 
		  			conf);
		  
		  //串联各个job
		  if(wikipediaToItemPrefsJob.waitForCompletion(true)){
			  if(userVectorToCooccurrenceJob.waitForCompletion(true)){
				  if(userVecotrSplitJob.waitForCompletion(true)){
					  if(combineUserVectorAndCoocMatrixJob.waitForCompletion(true)){
						   int rs = caclPartialRecomUserVectorJob.waitForCompletion(true) ? 1 :0;
						  return rs;
					  }else{
						  throw new Exception("合并共现向量和分割之后的用户向量作业失败！！");
					  }
				  }else{
					  throw new Exception("分割用户向量作业失败！！");
				  }
			  }else{
				  throw new Exception("计算共现向量作业失败！！");
			  }
		  }else{
			  throw new Exception("计算用户向量作业失败！！");
		  }
	}
	public static void main(String[] args) throws IOException,
			ClassNotFoundException, InterruptedException {
		try {
			int returnCode =  ToolRunner.run(new PackageRecomendJob(),args);
			System.exit(returnCode);
		} catch (Exception e) {
		}
	}

}

五、总结

本blog主要说了下itemBase推荐算法的一些概念，以及如何多现有数据进行建模。其中对共现矩阵方式的推荐用MapReduce结合Mahout的内置数据类型进行了实现。写完这篇blog和对算法实现完毕后，发现Mapreduce编程虽然数据模型非常简单，只有2个过程：数据的分散与合并，但是在分散与合并的过程中可以使用自定义的各种数据组合类型使其能够完成很多复杂的功能。

参考文献：《Mahout in action》、《推荐引擎实践》

转载请指明出处：http://zengzhaozheng.blog.51cto.com/8219051/1557054

你可能感兴趣的:(hadoop,推荐算法,itembase,共现矩阵,协同矩阵)

【harmonyOS NEXT 下的前端开发者】WAV音频编码实现九酒6 HarmonyOS harmonyos 音视频华为
继6年前使用js实现的mp4封装之后，再次回顾编解码的知识是在23年8月接收到的私信，让补充下插件里的音频部分。被迫回去翻了一下6年前的代码，然而发现当初提交的也没有音频的部分，而由于时间久远，早已忘记的差不多了，没能力赚这笔外快了。视频编码部分还是因为有保留的代码支持，才能捡回来一些。背景原文js实现封装MP4格式文件并下载中，因为近几年的技术更新与变化，一些重要的资料网站也被关停了。然而，我现
C# 三层架构与七层架构 bit&y C#三层架构
前言学习三层的时候对于这三层有了大致的了解，但是还是说不出个一二，今天试着总结一下，将自己的知识重新梳理一遍。三层架构概念三层架构通常意义上讲的就是将整个业务应用划分为：表现层（UI）、业务逻辑层（BLL）、数据访问层（DAL）。具体又分为：界面外观层、界面规则层、业务接口层、业务逻辑层、实体层、数据访问层、数据存储层共七层。为什么要分层？为了解耦，高内聚，低耦合提示三层架构指的不是一定要分三层，
三维空间的秘密：3D数学背后的几何之美！程序边界 3d
文章目录一、3D数学的核心概念1.1向量（Vector）1.2矩阵（Matrix）1.3坐标系（CoordinateSystem）二、3D数学的应用场景2.1三维建模与动画2.2光照与阴影2.3物理模拟三、如何学习与实践3D数学3.1学习资源推荐3.2实践建议四、未来展望《3D数学基础：图形和游戏开发（第2版）》内容简介目录解密向量、矩阵与坐标系的魔法，感受3D数学在科技与艺术中的无限魅力！在计算
PyTorch深度学习框架60天进阶学习计划 - 第18天：模型压缩技术凡人的AI工具箱深度学习 pytorch 学习 python 人工智能
PyTorch深度学习框架60天进阶学习计划-第18天：模型压缩技术目录模型压缩技术概述知识蒸馏详解软标签生成策略KL散度损失推导温度参数调节结构化剪枝技术通道剪枝评估准则L1-norm剪枝算法APoZ剪枝算法量化训练基础量化类型与精度PyTorch量化API剪枝与量化协同优化Torch.fx动态计算图修改自动化模型压缩流程实现实战案例：ResNet模型压缩性能评估与分析进阶挑战与思考1.模型压缩
hive 数字转换字符串_Hive架构及Hive SQL的执行流程解读 weixin_39756416 hive 数字转换字符串
1、Hive产生背景MapReduce编程的不便性HDFS上的文件缺少Schema(表名，名称，ID等，为数据库对象的集合)2、Hive是什么Hive的使用场景是什么？基于Hadoop做一些数据清洗啊(ETL)、报表啊、数据分析可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。Hive是SQL解析引擎，它将SQL语句转译成M/RJob然后在Hadoop执行。由Facebook开源，
代码随想录算法训练营day2| 209.长度最小的子数组|59.螺旋矩阵II|区间和|开发商购买土地 70ng 算法矩阵线性代数 leetcode java
209.长度最小的子数组找出该数组中满足其总和大于等于target的长度最小的子数组[numsl,numsl+1,...,numsr-1,numsr]，并返回其长度**。**如果不存在符合条件的子数组，返回0。classSolution{publicintminSubArrayLen(inttarget,int[]nums){intfast=0;//快指针intslow=0;//慢指针intsum
WPF-DataGrid的增删查改观无 wpf
背景：该功能为几乎所有系统开发都需要使用的功能，现提供简单的案例。1、MyCommandusingSystem;usingSystem.Collections.Generic;usingSystem.Linq;usingSystem.Text;usingSystem.Threading.Tasks;usingSystem.Windows.Input;namespaceWpfApplication2
第81期 | GPTSecurity周报 aigc网络安全
GPTSecurity是一个涵盖了前沿学术研究和实践经验分享的社区，集成了生成预训练Transformer（GPT）、人工智能生成内容（AIGC）以及大语言模型（LLM）等安全领域应用的知识。在这里，您可以找到关于GPT/AIGC/LLM最新的研究论文、博客文章、实用的工具和预设指令（Prompts）。现为了更好地知悉近一周的贡献内容，现总结如下。SecurityPapers1.大语言模型与代码安
LVGL v8学习笔记 | 字体的应用技巧嵌入式 CodeMaven 学习笔记前端嵌入式
LVGLv8学习笔记|字体的应用技巧嵌入式在嵌入式系统中，显示器的使用是非常普遍且重要的功能。而为了实现更灵活、美观的用户界面，字体的应用不可或缺。本文将介绍LVGLv8中字体的基本概念以及在嵌入式系统中使用字体的方法，并提供相应的源代码示例。一、字体的基本概念在LVGLv8中，字体是以像素点阵的形式存在的。每个字符由一系列像素点组成，这些像素点排列成矩阵，在显示器上渲染出相应的字符。字体可以分为
Vue3的Hook指南 Hopebearer_ Vue3 vue.js 前端 javascript
文章目录一、什么是Hook？1.技术本质2.与工具函数的区别二、Hook存在的意义1.解决传统模式的三大痛点2.核心优势矩阵三、开发实践指南1.基础创建模式2.组件内使用四、最佳实践1.复杂Hook结构2.类型安全增强五、应用场景1.状态共享方案2.跨组件通信六、性能优化策略1.副作用管理2.惰性加载Hook七、调试技巧1.开发工具追踪2.控制台检查八、应用案例1.数据可视化Hook2.微前端状态
Math.NET Numerics 库怎么装 9677 .net
你提到的缺少的库是Math.NETNumerics。关于Math.NETNumericsMath.NETNumerics是一个用于.NET平台的开源数学库，提供了以下功能：线性代数（矩阵运算、求解线性方程组等）。数值计算（积分、微分、优化等）。统计和概率分布。回归分析（包括多元线性回归）。它是C#中进行科学计算和数据分析的常用工具。安装Math.NETNumerics你可以通过NuGet包管理器安
数据结构----数组与广义表专题落春只在无意间 #数据结构数据结构线性代数算法
数组与广义表专题数组的顺序表示和实现前言数组中任意一个元素存储地址的计算一维数组二维数组更一般的二维数组矩阵的压缩存储前言对称矩阵三角矩阵前言上三角对应关系下三角关系三对角矩阵下标对应关系稀疏矩阵前言稀疏矩阵的三元组表示用三元组表示矩阵的转置优化快速转置数组的顺序表示和实现前言在计算机中，内存储器的结构是一维的。用一维的内存来表示多维数组，就必须按照某种次序将数组元素排成一个线性序列。数组中任意一
深度学习/机器学习入门基础数学知识整理（一）：线性代数基础，矩阵，范数等 chljerry_mouse 线性代数深度学习机器学习
前面大概有2年时间，利用业余时间断断续续写了一个机器学习方法系列，和深度学习方法系列，还有一个三十分钟理解系列（一些趣味知识）；新的一年开始了，今年给自己定的学习目标——以补齐基础理论为重点，研究一些基础课题；同时逐步继续写上述三个系列的文章。最近越来越多的研究工作聚焦研究多层神经网络的原理，本质，我相信深度学习并不是无法掌控的“炼金术”，而是真真实实有理论保证的理论体系；本篇打算摘录整理一些最最
2024华为OD机试真题-日志排序(C++)-E卷-100分 2024剑指offer 华为OD机试(C++)2025 华为od c++
2024华为OD机试最新E卷题库-(C卷+D卷+E卷)-(JAVA、Python、C++)目录题目描述输入描述输出描述示例1示例2示例3解题思路代码c++题目描述运维工程师采集到某产品现网运行一天产生的日志N条，现需根据日志时间按时间先后顺序对日志进行排序。日志时间格式为H:M:S.N，其中：H表示小时（0-23）。M表示分钟（0-59）。S表示秒（0-59）。N表示毫秒（0-999）。时间可能没
如何管理需求边界需求管理
在项目管理中，需求边界的有效控制对于确保交付质量和进度至关重要。清晰界定需求目标、维护需求优先级、动态跟踪和沟通、设立变更审查机制是管控需求边界的四大关键点。其中，动态跟踪和沟通尤为重要，通过定期同步、及时反馈和跨团队协同，能够使团队及时发现需求偏差并迅速做出决策，让项目在复杂多变的环境中保持灵活与稳定。以此为基础，项目经理可在每个里程碑节点审视需求完成度和资源分配，有效避免范围蔓延和需求冲突。一
指纹浏览器与代理IP的协同技术解析：从匿名性到防关联实现 Hotlogin 火云指纹浏览器指纹浏览器 tcp/ip 网络协议网络
一、技术背景与核心需求指纹浏览器通过模拟独立浏览器环境（包括User-Agent、Canvas指纹、WebGL参数等）实现多账号防关联，而代理IP通过隐藏真实IP地址提供网络匿名性。两者结合可解决以下技术痛点：双重身份隔离：单一指纹修改无法完全规避IP关联风险；动态环境模拟：代理IP支持地理位置切换，增强指纹浏览器模拟真实用户行为的能力；反检测能力提升：对抗网站基于IP黑名单或指纹追踪的风控策略。
3.孤岛的总面积六便士460 代码随想录之图论题解图论算法广度优先
题目描述给定一个由1（陆地）和0（水）组成的矩阵，岛屿指的是由水平或垂直方向上相邻的陆地单元格组成的区域，且完全被水域单元格包围。孤岛是那些位于矩阵内部、所有单元格都不接触边缘的岛屿。现在你需要计算所有孤岛的总面积，岛屿面积的计算方式为组成岛屿的陆地的总数。输入描述第一行包含两个整数N,M，表示矩阵的行数和列数。之后N行，每行包含M个数字，数字为1或者0。输出描述输出一个整数，表示所有孤岛的总面积
小哆啦解题记：旋转图像的奇妙旅程 dorabighead 大话力扣150题前端算法大话力扣
小哆啦开始刷力扣的第二十九天54.螺旋矩阵-力扣（LeetCode）️初次尝试：暴力解法，左右互搏小哆啦接到了一道任务：把一个n×n的二维矩阵顺时针旋转90度。“这不简单嘛！”小哆啦自信满满地甩了甩他的圆手，开始思考。直接上代码！varrotate=function(matrix){letn=matrix.length;letnewMatrix=Array.from({length:n},()=>
中级软件设计师2004-2024软考真题合集下载凡间晨光资源分享资源分享软考
中级软件设计师2004-2024软考真题合集下载资源亮点适用人群资源使用指南资源获取方式资源亮点「中级软件设计师历年真题及答案解析（2004-2024）」是全网最全、最新的备考资料合集，包含：21年完整真题（2004-2024年共42套）详细答案解析（含考点标注+解题思路）高频考点汇总（覆盖数据流图、设计模式等核心模块）考试大纲对照（2024版最新大纲匹配）适用人群✅正在备考软考中级的在职人员✅计
在hadoop上运行python_hadoop上运行python程序廷哥带你小路超车
数据来源：http://www.nber.org/patents/acite75_99.zip首先上传测试数据到hdfs：[root@localhost:/usr/local/hadoop/hadoop-0.19.2]#bin/hadoopfs-ls/user/root/test-inFound5items-rw-r--r--1rootsupergroup1012010-10-2414:39/us
onnx处理和TensorRT量化推理相关代码工具天亮换季人工智能算法深度学习
一.说明在模型量化过程中，经常要使用一些工具对onnx或者量化后的模型（这里以TensorRT为例）进行推理，往往需要一些处理工具，比如：拆分或者合并onnx；修改onnx中的量算子QuantizeLinear的scale值；以及使用onnxruntime进行推理；TensorRT的序列化文件的inference；隐式量化生成量化校准表…现提供一些封装好的工具，作为记录，方便日后查阅使用"
适配器模式详解：原理、C++代码实现、结构图、场景及优缺点五木大大 C++随想录适配器模式 c++c语言码蚁软件算法设计模式
一、原理及代码适配器模式是一种结构型设计模式，用于将一个类的接口转换成客户端所期望的另一个接口。这可以让原本由于接口不兼容而不能在一起工作的类能够协同工作。适配器模式通常包括三个角色：目标接口（Target）、适配器（Adapter）和被适配者（Adaptee）。适配器模式的原理如下：目标接口（Target）定义了客户端使用的特定接口。适配器（Adapter）实现了目标接口，并持有一个被适配者的实
解锁MATLAB语言：从入门到实战的编程秘籍大雨淅淅编程语言 matlab 开发语言
目录一、MATLAB是什么？二、搭建MATLAB环境三、基础语法入门3.1特殊符号与运算符3.2变量命名与赋值3.3向量与矩阵创建四、实战演练4.1简单数学运算4.2绘制函数图像五、深入学习建议一、MATLAB是什么？MATLAB，即MatrixLaboratory（矩阵实验室），是美国MathWorks公司开发的一款商业数学软件，也是众多工程师和数学家钟爱的编程与数值计算平台。自1984年首次发
造价算量审图多元化融合软件开发实战：技术架构与核心代码解析夏末之花架构
——从BIM模型解析到AI智能审图的完整实现路径1.技术架构设计该软件需融合以下模块：BIM/CAD模型解析引擎（支持Revit/DWG文件一键导入）智能算量核心算法（基于规则引擎与机器学习）协同审图平台（多人实时标注与版本控制）AI辅助决策系统（材料价格预测、工程量误差检测）技术栈推荐：前端：Three.js（3D模型渲染）+React（协同界面）后端：Python（算量算法）+Java（业务逻
基于大模型的Text2SQL微调的实战教程(二) herosunly AIGC Text2SQL 微调实战教程
大家好，我是herosunly。985院校硕士毕业，现担任算法研究员一职，热衷于机器学习算法研究与应用。曾获得阿里云天池比赛第一名，CCF比赛第二名，科大讯飞比赛第三名。拥有多项发明专利。对机器学习和深度学习拥有自己独到的见解。曾经辅导过若干个非计算机专业的学生进入到算法行业就业。希望和大家一起成长进步。本文主要介绍了基于大模型的Text2SQL微调的实战教程(二)，希望对学习大语言模型的
ranger集成starrock报错蘑菇丁大数据+机器学习+oracle 大数据
org.apache.ranger.plugin.client.HadoopException:initConnection:UnabletoconnecttoStarRocksinstance,pleaseprovidevalidvalueoffield:{jdbc.driverClassName}..com.mysql.cj.jdbc.Driver.可能的原因JDBC驱动缺失：运行环境中没有安
告别高租金压力！西安国际科创产业园火热招商中！国际数字科创产业园人工智能创业创新传媒大数据
在企业发展的征程中，租金成本往往是一道绕不开的难题。高昂的租金压力，如同沉重的枷锁，束缚着企业前行的步伐。而如今，西安国际科创产业园的出现，为众多企业带来了告别高租金压力的希望曙光。西安国际科创产业园坐落于城市发展的关键区域，这里交通网络四通八达，无论是货物运输还是人员往来，都极为便捷。周边高校、科研机构林立，形成了浓厚的学术与创新氛围，为企业的人才储备和技术研发提供了得天独厚的条件。走进园区，现
华为认证的用处 outuo219 华为
华为认证肯定是有用的，尤其是目前网工行业，国内最火的华为认证应该是算得上一个的。华为认证是华为技术有限公司（简称“华为”）基于“平台+生态”战略，围绕“云-管-端”协同的新ICT技术架构，打造的业界覆盖ICT领域最广的认证体系，包含“ICT技术架构认证”、“平台与服务认证“、“行业ICT认证”三类认证。根据ICT从业者的学习和进阶需求，华为认证分为工程师级别、高级工程师级别和专家级别三个认证等级。
友思特新闻 | 再创佳绩！友思特荣获“机器人技术成果创新创业大赛”三等奖！友思特机器视觉与光电图像处理机器视觉光电检测 OCT
2024年11月22日，“2024粤港澳大湾区科技协同创新联盟机器人技术成果转移转化活动”圆满落下帷幕。赛事奖项介绍：机器人技术成果创新创业大赛2024粤港澳大湾区科技协同创新联盟机器人技术成果转移转化活动于11月22日在广东粤港澳大湾区国家纳米科技创新研究院学术报告厅隆重举行。活动锚定“力度更大、成果更多”的目标，围绕“湾区智融·科创领航”主题，开展授牌、演讲、数据服务、成果发布、推介、比赛等一
TikTok矩阵营销：掀开全球营销新篇章全球通@安心矩阵人工智能大数据新媒体运营内容运营用户运营产品运营
在流量为王的时代，TikTok已成为品牌争相进入的核心战场。如何在全球范围内快速抢占市场、吸引潜在客户，是每个品牌的共同课题。TikTok矩阵获客系统凭借其数据驱动、内容矩阵和智能化管理的多维优势，为品牌打开了通往全球增长的大门。数据驱动：让投放更科学TikTok矩阵获客系统的最大亮点是其强大的数据分析能力，通过AI技术深入挖掘用户信息，帮助品牌精准捕捉目标客户。●精准定位用户画像：系统基于年龄、
安装数据库首次应用 Array_06 java oracle sql
可是为什么再一次失败之后就变成直接跳过那个要求 enter full pathname of java.exe的界面这个java.exe是你的Oracle 11g安装目录中例如：【F:\app\chen\product\11.2.0\dbhome_1\jdk\jre\bin】下的java.exe 。不是你的电脑安装的java jdk下的java.exe！注意第一次，使用SQL D
Weblogic Server Console密码修改和遗忘解决方法 bijian1013 Welogic
在工作中一同事将Weblogic的console的密码忘记了，通过网上查询资料解决，实践整理了一下。一.修改Console密码打开weblogic控制台，安全领域 --> myrealm -->&n
IllegalStateException: Cannot forward a response that is already committed Cwind java Servlets
对于初学者来说，一个常见的误解是：当调用 forward() 或者 sendRedirect() 时控制流将会自动跳出原函数。标题所示错误通常是基于此误解而引起的。示例代码： protected void doPost() { if (someCondition) { sendRedirect(); } forward(); // Thi
基于流的装饰设计模式木zi_鸣设计模式
当想要对已有类的对象进行功能增强时，可以定义一个类，将已有对象传入，基于已有的功能，并提供加强功能。自定义的类成为装饰类模仿BufferedReader，对Reader进行包装，体现装饰设计模式装饰类通常会通过构造方法接受被装饰的对象，并基于被装饰的对象功能，提供更强的功能。装饰模式比继承灵活，避免继承臃肿，降低了类与类之间的关系装饰类因为增强已有对象，具备的功能该
Linux中的uniq命令被触发 linux
Linux命令uniq的作用是过滤重复部分显示文件内容，这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个及以后更多个重复行将被删去，行比较是根据所用字符集的排序序列进行的。该命令加工后的结果写到输出文件中。输入文件和输出文件必须不同。如果输入文件用“- ”表示，则从标准输入读取。 AD： uniq [选项] 文件说明：这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个
正则表达式Pattern 肆无忌惮_ Pattern
正则表达式是符合一定规则的表达式，用来专门操作字符串，对字符创进行匹配，切割，替换，获取。例如，我们需要对QQ号码格式进行检验规则是长度6~12位不能0开头只能是数字，我们可以一位一位进行比较，利用parseLong进行判断，或者是用正则表达式来匹配[1-9][0-9]{4,14} 或者 [1-9]\d{4,14} &nbs
Oracle高级查询之OVER (PARTITION BY ..) 知了ing oracle sql
一、rank()/dense_rank() over(partition by ...order by ...) 现在客户有这样一个需求，查询每个部门工资最高的雇员的信息，相信有一定oracle应用知识的同学都能写出下面的SQL语句： select e.ename, e.job, e.sal, e.deptno from scott.emp e, (se
Python调试矮蛋蛋 python pdb
原文地址： http://blog.csdn.net/xuyuefei1988/article/details/19399137 1、下面网上收罗的资料初学者应该够用了，但对比IBM的Python 代码调试技巧： IBM：包括 pdb 模块、利用 PyDev 和 Eclipse 集成进行调试、PyCharm 以及 Debug 日志进行调试： http://www.ibm.com/d
webservice传递自定义对象时函数为空，以及boolean不对应的问题 alleni123 webservice
今天在客户端调用方法 NodeStatus status=iservice.getNodeStatus(). 结果NodeStatus的属性都是null。进行debug之后，发现服务器端返回的确实是有值的对象。后来发现原来是因为在客户端，NodeStatus的setter全部被我删除了。本来是因为逻辑上不需要在客户端使用setter，结果改了之后竟然不能获取带属性值的
java如何干掉指针，又如何巧妙的通过引用来操作指针————>说的就是java指针百合不是茶
C语言的强大在于可以直接操作指针的地址，通过改变指针的地址指向来达到更改地址的目的,又是由于c语言的指针过于强大，初学者很难掌握， java的出现解决了c，c++中指针的问题 java将指针封装在底层，开发人员是不能够去操作指针的地址，但是可以通过引用来间接的操作：定义一个指针p来指向a的地址（&是地址符号）：
Eclipse打不开，提示“An error has occurred.See the log file ***/.log” bijian1013 eclipse
打开eclipse工作目录的\.metadata\.log文件，发现如下错误： !ENTRY org.eclipse.osgi 4 0 2012-09-10 09:28:57.139 !MESSAGE Application error !STACK 1 java.lang.NoClassDefFoundError: org/eclipse/core/resources/IContai
spring aop实例annotation方法实现 bijian1013 java spring AOP annotation
在spring aop实例中我们通过配置xml文件来实现AOP，这里学习使用annotation来实现，使用annotation其实就是指明具体的aspect,pointcut和advice。1.申明一个切面(用一个类来实现)在这个切面里,包括了advice和pointcut AdviceMethods.jav
[Velocity一]Velocity语法基础入门 bit1129 velocity
用户和开发人员参考文档 http://velocity.apache.org/engine/releases/velocity-1.7/developer-guide.html 注释 1.行级注释## 2.多行注释#* *# 变量定义使用$开头的字符串是变量定义，例如$var1, $var2, 赋值使用#set为变量赋值，例
【Kafka十一】关于Kafka的副本管理 bit1129 kafka
1. 关于request.required.acks request.required.acks控制者Producer写请求的什么时候可以确认写成功，默认是0， 0表示即不进行确认即返回。 1表示Leader写成功即返回，此时还没有进行写数据同步到其它Follower Partition中 -1表示根据指定的最少Partition确认后才返回，这个在 Th
lua统计nginx内部变量数据 ronin47 lua nginx　统计
server { listen 80; server_name photo.domain.com; location /{set $str $uri; content_by_lua ' local url = ngx.var.uri local res = ngx.location.capture(
java-11.二叉树中节点的最大距离 bylijinnan java
import java.util.ArrayList; import java.util.List; public class MaxLenInBinTree { /* a. 1 / \ 2 3 / \ / \ 4 5 6 7 max=4 pass "root"
Netty源码学习-ReadTimeoutHandler bylijinnan java netty
ReadTimeoutHandler的实现思路：开启一个定时任务，如果在指定时间内没有接收到消息，则抛出ReadTimeoutException 这个异常的捕获，在开发中，交给跟在ReadTimeoutHandler后面的ChannelHandler，例如 private final ChannelHandler timeoutHandler = new ReadTim
jquery验证上传文件样式及大小(好用) cngolon 文件上传 jquery验证
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script src="jquery1.8/jquery-1.8.0.
浏览器兼容【转】 cuishikuan css 浏览器 IE
浏览器兼容问题一：不同浏览器的标签默认的外补丁和内补丁不同问题症状：随便写几个标签，不加样式控制的情况下，各自的margin 和padding差异较大。碰到频率:100% 解决方案：CSS里 *{margin:0;padding:0;} 备注：这个是最常见的也是最易解决的一个浏览器兼容性问题，几乎所有的CSS文件开头都会用通配符*来设
Shell特殊变量：Shell $0, $#, $*, $@, $?, $$和命令行参数 daizj shell $#$?特殊变量
前面已经讲到，变量名只能包含数字、字母和下划线，因为某些包含其他字符的变量有特殊含义，这样的变量被称为特殊变量。例如，$ 表示当前Shell进程的ID，即pid，看下面的代码： $echo $$ 运行结果 29949 特殊变量列表变量含义 $0 当前脚本的文件名 $n 传递给脚本或函数的参数。n 是一个数字，表示第几个参数。例如，第一个
程序设计KISS 原则-------KEEP IT SIMPLE, STUPID! dcj3sjt126com unix
翻到一本书，讲到编程一般原则是kiss：Keep It Simple, Stupid.对这个原则深有体会，其实不仅编程如此，而且系统架构也是如此。 KEEP IT SIMPLE, STUPID! 编写只做一件事情，并且要做好的程序；编写可以在一起工作的程序，编写处理文本流的程序，因为这是通用的接口。这就是UNIX哲学.所有的哲学真正的浓缩为一个铁一样的定律，高明的工程师的神圣的“KISS 原
android Activity间List传值 dcj3sjt126com Activity
第一个Activity： import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.Map;import android.app.Activity;import android.content.Intent;import android.os.Bundle;import a
tomcat 设置java虚拟机内存 eksliang tomcat 内存设置
转载请出自出处：http://eksliang.iteye.com/blog/2117772 http://eksliang.iteye.com/ 常见的内存溢出有以下两种: java.lang.OutOfMemoryError: PermGen space java.lang.OutOfMemoryError: Java heap space ------------
Android 数据库事务处理 gqdy365 android
使用SQLiteDatabase的beginTransaction()方法可以开启一个事务，程序执行到endTransaction() 方法时会检查事务的标志是否为成功，如果程序执行到endTransaction()之前调用了setTransactionSuccessful() 方法设置事务的标志为成功则提交事务，如果没有调用setTransactionSuccessful() 方法则回滚事务。事
Java 打开浏览器 hw1287789687 打开网址 open浏览器 open browser 打开url 打开浏览器
使用java 语言如何打开浏览器呢? 我们先研究下在cmd窗口中,如何打开网址使用IE 打开 D:\software\bin>cmd /c start iexplore http://hw1287789687.iteye.com/blog/2153709 使用火狐打开 D:\software\bin>cmd /c start firefox http://hw1287789
ReplaceGoogleCDN：将 Google CDN 替换为国内的 Chrome 插件 justjavac chrome Google google api chrome插件
Chrome Web Store 安装地址： https://chrome.google.com/webstore/detail/replace-google-cdn/kpampjmfiopfpkkepbllemkibefkiice 由于众所周知的原因，只需替换一个域名就可以继续使用Google提供的前端公共库了。同样，通过script标记引用这些资源，让网站访问速度瞬间提速吧
进程VS.线程 m635674608 线程
资料来源： http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001397567993007df355a3394da48f0bf14960f0c78753f000 1、Apache最早就是采用多进程模式 2、IIS服务器默认采用多线程模式 3、多进程优缺点优点：多进程模式最大
Linux下安装MemCached 字符串 memcached
前提准备：1. MemCached目前最新版本为：1.4.22，可以从官网下载到。2. MemCached依赖libevent，因此在安装MemCached之前需要先安装libevent。2.1 运行下面命令，查看系统是否已安装libevent。[root@SecurityCheck ~]# rpm -qa|grep libevent libevent-headers-1.4.13-4.el6.n
java设计模式之--jdk动态代理（实现aop编程） Supanccy2013 java DAO 设计模式 AOP
与静态代理类对照的是动态代理类，动态代理类的字节码在程序运行时由Java反射机制动态生成，无需程序员手工编写它的源代码。动态代理类不仅简化了编程工作，而且提高了软件系统的可扩展性，因为Java 反射机制可以生成任意类型的动态代理类。java.lang.reflect 包中的Proxy类和InvocationHandler 接口提供了生成动态代理类的能力。 &
Spring 4.2新特性-对java8默认方法(default method)定义Bean的支持 wiselyman spring 4
2.1 默认方法(default method) java8引入了一个default medthod; 用来扩展已有的接口,在对已有接口的使用不产生任何影响的情况下,添加扩展使用default关键字 Spring 4.2支持加载在默认方法里声明的bean 2.2 将要被声明成bean的类 public class DemoService {