PHP根据姓名分析男女性别(Python转PHP)

PHP根据姓名分析男女性别
参考:https://github.com/observerss/ngender
统计数据下载:https://download.csdn.net/download/webben/11253847


$name = $argv[1];
$gender = new Gender();
$ret = $gender->guess( $name);
echo $name."\t".$ret[0]."\t".$ret[1];

class Gender
{
    /**
     * @var int 样本总数
     */
    protected $total = 0;

    /**
     * @var int 男性
     */
    protected $male_total = 0;

    /**
     * @var int 女性
     */
    protected $female_total = 0;

    /**
     * @var array 比例
     */
    protected $freq = array();

    protected $sample = './charfreq.csv';

    public function __construct()
    {
        $this->load_model();
    }

    public function guess( $name)
    {
        $name = $this->name_to_array( $this->to_utf8( $name));
        array_shift( $name);
        foreach( $name as $val)
        {
            if( !$this->is_chinese( $val)) {
                throw new Exception('名字必须是中文');
            }
        }
        $pf = $this->prob_for_gender( $name, 0);
        $pm = $this->prob_for_gender( $name, 1);
        if( $pm > $pf){
            return array( 'male', $pm/( $pm+$pf));
        }elseif( $pm < $pf){
            return array( 'female', $pf/( $pm+$pf));
        }
        return array( 'unknown', 0);
    }

    protected function prob_for_gender( $name, $gender = 0)
    {
        $p = ( $gender == 0)
            ? $this->female_total  / $this->total
            : $this->male_total  / $this->total;
        foreach( $name as $val)
        {
            $p *= $this->freq[$val][$gender];
        }
        return $p;
    }

    protected function is_chinese( $name)
    {
        return preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $name) === 1;
    }

    protected function name_to_array( $name)
    {
        preg_match_all("/./u", $name, $matches);
        return $matches[0];
    }

    protected function load_model()
    {
    	if( !file_exists( $this->sample)) throw new Exception('样本数据不存在,请下载:https://download.csdn.net/download/webben/11253847');
        $fp = fopen( $this->sample, 'r');
        fgetcsv( $fp);
        while( ($line = fgets( $fp)) !== false)
        {
            $line = trim( $line);
            list( $char, $male, $female) = explode(',', $line);
            $char = $this->to_utf8( $char);
            $this->male_total += $male;
            $this->female_total += $female;
            $this->freq[$char] = array( $female, $male);
        }
        $this->total = $this->male_total + $this->female_total;
        foreach( $this->freq as $char => $val)
        {
            list( $female, $male) = $val;
            $this->freq[ $char] = array( $female/$this->female_total,
                $male / $this->male_total);
        }
    }

    protected function to_utf8( $name)
    {
        return $name;
    }
}

NGender

根据中文姓名猜测其性别

  • 不到20行纯Python代码(核心部分)
  • 无任何依赖库
  • 82%的准确率
  • 可用于猜测性别
  • 也可用于判断名字的男性化/女性化程度

使用

然后在命令行中

$ ng 赵本山 宋丹丹
name: 赵本山 => gender: male, probability: 0.9836229687547046
name: 宋丹丹 => gender: female, probability: 0.9759486128949907

当然也可以在Python程序中用

php ngender.php 赵本山
('male', 0.9836229687547046)

php ngender.php 宋丹丹
('female', 0.9759486128949907)

原理

数学

贝叶斯公式: P(Y|X) = P(X|Y) * P(Y) / P(X)

当X条件独立时, P(X|Y) = P(X1|Y) * P(X2|Y) * ...

应用到猜名字上

P(gender=男|name=本山) 
= P(name=本山|gender=男) * P(gender=男) / P(name=本山)
= P(name has 本|gender=男) * P(name has 山|gender=男) * P(gender=男) / P(name=本山)

计算

  1. 文件charfreq.csv是怎么来的?

    曾经有个东西叫开房记录.avi(雾),里面有名字和性别, 2000w条, 统计一下得出

  2. 怎么算 P(name has 本|gender=男)?

    “本”在男性名字中出现的次数 / 男性字出现的总次数

  3. 怎么算 P(gender=男)?

    男性名出现的次数 / 总次数

  4. 怎么算 P(name=本山)?

    不用算, 在算概率的时候会互相约去

php ngender.php 李胜男
('male', 0.851334658742)

虽然两个字都很偏男性,但是结合起来就是女性名

你可能感兴趣的:(PHP)