2017-09-14

movieLens推荐电影(一)

下载数据集地址： https://grouplens.org/datasets/movielens/

基于物品推荐算法：

获取数据集
用皮尔逊系数求每部电影的前k个相似电影
根据用户id，求出用户评价过的电影
得到评价过的电影的相似电影
对于这些相似电影，求评分的加权平均数
取前k个，即为推荐的电影

1. 数据集介绍

这里使用 ml-100k 的数据

This data set consists of:

100,000 ratings (1-5) from 943 users on 1682 movies.
Each user has rated at least 20 movies.
Simple demographic info for the users (age, gender, occupation, zip) The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998.

数据集包含 u.data\u.genre\u.info\u.item\u.occupation\u.user 等。

u.data: 用户id，电影id，评分，时间戳
u.info: 用户数量，电影数量，评分数量
u.item: 电影id，电影标题，上映日期，光碟上映日期，IMDb URL，流派
u.genre: 流派
u.user: 用户id，年龄，性别，职位，邮编

2. 读取数据

先分析 {用户，电影，评分}，读取 u.data 和 u.item

movieList = {}
for line in open("data/ml-100k/u.item", encoding='ISO-8859-1').readlines():
	# u.item 包含电影id,电影名称
	(movieId, title) = line.split('|')[0:2]
	movieList[movieId] = title
userInfo = {}
userInfoUid = {}
for line in open("data/ml-100k/u.data").readlines():
	# u.data 包含用户id,电影id,评分
	(uid, mid, rating) = line.split('\t')[0:3]
	if uid not in userInfoUid.keys():
		userInfoUid[uid] = {}
	userInfoUid[uid][movieList[mid]] = int(rating)
	if movieList[mid] not in userInfo.keys():
		userInfo[movieList[mid]] = {}
	userInfo[movieList[mid]][uid] = int(rating)

打印结果：

movieList:

{'263': 'Steel (1997)',
 '1419': 'Highlander III: The Sorcerer (1994)',
 '1205': 'Secret Agent, The (1996)',
 '377': 'Heavyweights (1994)',
 '1504': 'Bewegte Mann, Der (1994)',
 ...
}

userInfo：

{'1': {'101 Dalmatians (1996)': 2,
  '12 Angry Men (1957)': 5,
  '20,000 Leagues Under the Sea (1954)': 3,
  '2001: A Space Odyssey (1968)': 4,
  'Abyss, The (1989)': 3,
  ...
}

3. 计算距离

皮尔逊相关系数：通过每个电影人们的评价计算出电影之间的关系

人们的评价标准不同，有人偏低有人偏高
分数不同，但是评价有相同的趋势，也认为两人相同

def pearson(data, movie1, movie2):
	personList = [person for person in data[movie1].keys() if person in data[movie2].keys()]
	pLen = len(personList)	
	if pLen == 0:
		return 0
	# print('movie1:',movie1)
	# print('movie2:',movie2)
	# print('personList:',personList)
	# print('pLen:',pLen)
	rating1 = [data[movie1][p] for p in personList]
	rating2 = [data[movie2][p] for p in personList]
	ratingSq1 = [data[movie1][p]**2 for p in personList]
	ratingSq2 = [data[movie2][p]**2 for p in personList]	
	# 计算评价和 评价平方和 评价成绩和
	sum1 = sum(rating1)
	sum2 = sum(rating2)
	sumSq1 = sum(ratingSq1)
	sumSq2 = sum(ratingSq2)
	psum = sum([data[movie1][p] * data[movie2][p] for p in personList])
	# 皮尔逊相关系数计算
	num = psum - (sum1 * sum2) / pLen
	den = np.sqrt((sumSq1 - np.square(sum1)/pLen) * (sumSq2 - np.square(sum2)/pLen))
	if den == 0:
		return 0
	return num/den

4. 计算出该电影最相似的前k部电影

计算该电影与每部电影的皮尔逊系数，对结果进行排序，取前k个。

def topRating(data, movie, k = 5):
	# 计算该电影与每部电影之间的皮尔逊相关系数
	scores = {}
	for mov in data.keys():
		if mov != movie:
			scores[mov] = pearson(data, movie, mov)
	scoSorted = sorted(scores.items(),key=lambda scores:scores[1],reverse=True)		
	# print('movie {0}, scoSorted: top {1}, {2}'.format(movie, k, scoSorted[:k]))
	return scoSorted[:k]

5. 获取每一部电影的相似电影

遍历每一部电影，计算每部电影的相似电影。

def getMovieList(data):
	matchMovieList = {}
	for mov in data.keys():
		matchMovieList[mov] = topRating(data, mov, 5)
	return matchMovieList

6. 推荐电影：

流程：

获取用户所有评价过的电影
遍历每一部电影
获取该电影的相似电影
判断相似电影是否被用户评价过
计算相关系数*评分并累加
相关系数累加
求出评分的加权平均数
得分高的前k名就是推荐的电影

def getRecommendMov(data, matchmov, userid, k=5):
	try:
		userRating = data[userid]
	except KeyError:
		print('No User')
		return 0
	scores = {}  #记录加权和	
	totalSco = {} #记录评分和
	# 用户所有评过分的电影
	for mov, rating in userRating.items():
		# 遍历当前电影的所有相似电影
		for nearMov, nearPear in matchmov[mov]:
			if nearMov in userRating.keys():
				continue
			if nearMov not in scores.keys():
				scores[nearMov] = nearPear * rating
				totalSco[nearMov] = nearPear
			scores[nearMov] += nearPear * rating
			totalSco[nearMov] += nearPear
	rankings = [(scores[nearMov]/totalSco[nearMov],nearMov) for nearMov in  scores.keys() if totalSco[nearMov] != 0]
	rankings.sort(key=lambda x:x[0], reverse=True)
	recommendMov = [rankings[i][1] for i in range(k)]
	return recommendMov

7. 输出结果

输入 用户id，即可得到5部推荐电影。输入 exit 退出。

➜  movieLens git:(master) ✗ python movieLens.py
input userid:1
near: ['Best of the Best 3: No Turning Back (1995)', 'Senseless (1998)', 'Shadow, The (1994)', 'Turbulence (1997)', 'Fear of a Black Hat (1993)']
input userid:12
near: ['Wild Bill (1995)', 'Year of the Horse (1997)', 'Telling Lies in America (1997)', 'In Love and War (1996)', 'Metisse (Café au Lait) (1993)']
input userid:13
near: ['Caro Diario (Dear Diary) (1994)', 'Even Cowgirls Get the Blues (1993)', 'Senseless (1998)', 'Love Serenade (1996)', 'Herbie Rides Again (1974)']
input userid:exit
➜  movieLens git:(master) ✗

8. 其他

8.1.读取 u.item 解码错误

1	UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

总不能把原文件的错误地方找出来删了吧。

解决办法：

1	for line in open("data/ml-100k/u.item", encoding='ISO-8859-1').readlines():

新增 encoding='ISO-8859-1'

8.2. sys.setdefaultencoding(‘utf-8’) 报错

AttributeError: module 'sys' has no attribute 'setdefaultencoding'

python 3.3 开始已经移除了这个方法。默认 ‘utf-8’ 编码。

可以通过 sys.getdefaultencoding() 得知：

print(sys.getdefaultencoding())
➜  movieLens git:(master) ✗ python movieLens.py
utf-8

9. 代码

# -*- coding: utf-8 -*-
# !/usr/bin/env python

import sys
import numpy as np
import pandas as pd

def loadTrainSet():
    movieList = {}
    for line in open("data/ml-100k/u.item", encoding='ISO-8859-1').readlines():
        # u.item 包含电影id,电影名称
        (movieId, title) = line.split('|')[0:2]
        movieList[movieId] = title

    userInfo = {}
    userInfoUid = {}
    for line in open("data/ml-100k/u.data").readlines():
        # u.data 包含用户id,电影id,评分
        (uid, mid, rating) = line.split('\t')[0:3]
        if uid not in userInfoUid.keys():
            userInfoUid[uid] = {}
        userInfoUid[uid][movieList[mid]] = int(rating)

        if movieList[mid] not in userInfo.keys():
            userInfo[movieList[mid]] = {}
        userInfo[movieList[mid]][uid] = int(rating)
    return     movieList,userInfo,userInfoUid        

# 计算距离
# 皮尔逊相关系数评价:
def pearson(data, movie1, movie2):
    personList = [person for person in data[movie1].keys() if person in data[movie2].keys()]
    pLen = len(personList)    
    if pLen == 0:
        return 0
    # print('movie1:',movie1)
    # print('movie2:',movie2)
    # print('personList:',personList)
    # print('pLen:',pLen)

    rating1 = [data[movie1][p] for p in personList]
    rating2 = [data[movie2][p] for p in personList]
    ratingSq1 = [data[movie1][p]**2 for p in personList]
    ratingSq2 = [data[movie2][p]**2 for p in personList]    

    # 计算评价和 评价平方和 评价成绩和
    sum1 = sum(rating1)
    sum2 = sum(rating2)
    sumSq1 = sum(ratingSq1)
    sumSq2 = sum(ratingSq2)
    psum = sum([data[movie1][p] * data[movie2][p] for p in personList])

    # 皮尔逊相关系数计算
    num = psum - (sum1 * sum2) / pLen
    den = np.sqrt((sumSq1 - np.square(sum1)/pLen) * (sumSq2 - np.square(sum2)/pLen))

    if den == 0:
        return 0
    return num/den

def topRating(data, movie, k = 5):
    # 计算该电影与每部电影之间的皮尔逊相关系数
    scores = {}
    for mov in data.keys():
        if mov != movie:
            scores[mov] = pearson(data, movie, mov)
    scoSorted = sorted(scores.items(),key=lambda scores:scores[1],reverse=True)        
    # print('movie {0}, scoSorted: top {1}, {2}'.format(movie, k, scoSorted[:k]))
    return scoSorted[:k]

def getMovieList(data):
    matchMovieList = {}
    for mov in data.keys():
        matchMovieList[mov] = topRating(data, mov, 5)
    return matchMovieList

def getRecommendMov(data, matchmov, userid, k=5):
    try:
        userRating = data[userid]
    except KeyError:
        print('No User')
        return 0
    scores = {}  #记录加权和    
    totalSco = {} #记录评分和

    # 用户所有评过分的电影
    for mov, rating in userRating.items():
        # 遍历当前电影的所有相似电影
        for nearMov, nearPear in matchmov[mov]:
            if nearMov in userRating.keys():
                continue
            if nearMov not in scores.keys():
                scores[nearMov] = nearPear * rating
                totalSco[nearMov] = nearPear
            scores[nearMov] += nearPear * rating
            totalSco[nearMov] += nearPear

    rankings = [(scores[nearMov]/totalSco[nearMov],nearMov) for nearMov in  scores.keys() if totalSco[nearMov] != 0]
    rankings.sort(key=lambda x:x[0], reverse=True)
    recommendMov = [rankings[i][1] for i in range(k)]
    return recommendMov

def movielensClass():
    movieList,userInfo,userInfoUid = loadTrainSet()
    matchmov = getMovieList(userInfo)
    return matchmov,userInfo,userInfoUid


if __name__ == '__main__':
    matchmov,userInfo,userInfoUid = movielensClass()
    while True:
        userid = input("input userid:")
        if userid == 'exit':
            break
        else:
            near = getRecommendMov(userInfoUid, matchmov, userid)
            print('near:',near)