movieLens推荐电影(一)

下载数据集地址: https://grouplens.org/datasets/movielens/

基于物品推荐算法

  1. 获取数据集
  2. 用皮尔逊系数求每部电影的前k个相似电影
  3. 根据用户id,求出用户评价过的电影
  4. 得到评价过的电影的相似电影
  5. 对于这些相似电影,求评分的加权平均数
  6. 取前k个,即为推荐的电影

1. 数据集介绍

这里使用 ml-100k 的数据

This data set consists of:

  • 100,000 ratings (1-5) from 943 users on 1682 movies.
  • Each user has rated at least 20 movies.
  • Simple demographic info for the users (age, gender, occupation, zip) The data was collected through the MovieLens web site
    (movielens.umn.edu) during the seven-month period from September 19th,
    1997 through April 22nd, 1998.

数据集包含 u.data\u.genre\u.info\u.item\u.occupation\u.user 等。

  • u.data: 用户id,电影id,评分,时间戳
  • u.info: 用户数量,电影数量,评分数量
  • u.item: 电影id,电影标题,上映日期,光碟上映日期,IMDb URL,流派
  • u.genre: 流派
  • u.user: 用户id,年龄,性别,职位,邮编

2. 读取数据

先分析 {用户,电影,评分},读取 u.data 和 u.item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
movieList = {}
for line in open("data/ml-100k/u.item", encoding='ISO-8859-1').readlines():
# u.item 包含电影id,电影名称
(movieId, title) = line.split('|')[0:2]
movieList[movieId] = title
userInfo = {}
userInfoUid = {}
for line in open("data/ml-100k/u.data").readlines():
# u.data 包含用户id,电影id,评分
(uid, mid, rating) = line.split('\t')[0:3]
if uid not in userInfoUid.keys():
userInfoUid[uid] = {}
userInfoUid[uid][movieList[mid]] = int(rating)
if movieList[mid] not in userInfo.keys():
userInfo[movieList[mid]] = {}
userInfo[movieList[mid]][uid] = int(rating)

打印结果:

movieList:

1
2
3
4
5
6
7
{'263': 'Steel (1997)',
'1419': 'Highlander III: The Sorcerer (1994)',
'1205': 'Secret Agent, The (1996)',
'377': 'Heavyweights (1994)',
'1504': 'Bewegte Mann, Der (1994)',
...
}

userInfo:

1
2
3
4
5
6
7
{'1': {'101 Dalmatians (1996)': 2,
'12 Angry Men (1957)': 5,
'20,000 Leagues Under the Sea (1954)': 3,
'2001: A Space Odyssey (1968)': 4,
'Abyss, The (1989)': 3,
...
}

3. 计算距离

皮尔逊相关系数:通过每个电影人们的评价计算出电影之间的关系

  1. 人们的评价标准不同,有人偏低有人偏高
  2. 分数不同,但是评价有相同的趋势,也认为两人相同
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def pearson(data, movie1, movie2):
personList = [person for person in data[movie1].keys() if person in data[movie2].keys()]
pLen = len(personList)
if pLen == 0:
return 0
# print('movie1:',movie1)
# print('movie2:',movie2)
# print('personList:',personList)
# print('pLen:',pLen)
rating1 = [data[movie1][p] for p in personList]
rating2 = [data[movie2][p] for p in personList]
ratingSq1 = [data[movie1][p]**2 for p in personList]
ratingSq2 = [data[movie2][p]**2 for p in personList]
# 计算评价和 评价平方和 评价成绩和
sum1 = sum(rating1)
sum2 = sum(rating2)
sumSq1 = sum(ratingSq1)
sumSq2 = sum(ratingSq2)
psum = sum([data[movie1][p] * data[movie2][p] for p in personList])
# 皮尔逊相关系数计算
num = psum - (sum1 * sum2) / pLen
den = np.sqrt((sumSq1 - np.square(sum1)/pLen) * (sumSq2 - np.square(sum2)/pLen))
if den == 0:
return 0
return num/den

4. 计算出该电影最相似的前k部电影

计算该电影与每部电影的皮尔逊系数,对结果进行排序,取前k个。

1
2
3
4
5
6
7
8
9
def topRating(data, movie, k = 5):
# 计算该电影与每部电影之间的皮尔逊相关系数
scores = {}
for mov in data.keys():
if mov != movie:
scores[mov] = pearson(data, movie, mov)
scoSorted = sorted(scores.items(),key=lambda scores:scores[1],reverse=True)
# print('movie {0}, scoSorted: top {1}, {2}'.format(movie, k, scoSorted[:k]))
return scoSorted[:k]

5. 获取每一部电影的相似电影

遍历每一部电影,计算每部电影的相似电影。

1
2
3
4
5
def getMovieList(data):
matchMovieList = {}
for mov in data.keys():
matchMovieList[mov] = topRating(data, mov, 5)
return matchMovieList

6. 推荐电影:

流程:

  1. 获取用户所有评价过的电影
  2. 遍历每一部电影
  3. 获取该电影的相似电影
  4. 判断相似电影是否被用户评价过
  5. 计算相关系数*评分并累加
  6. 相关系数累加
  7. 求出评分的加权平均数
  8. 得分高的前k名就是推荐的电影
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def getRecommendMov(data, matchmov, userid, k=5):
try:
userRating = data[userid]
except KeyError:
print('No User')
return 0
scores = {} #记录加权和
totalSco = {} #记录评分和
# 用户所有评过分的电影
for mov, rating in userRating.items():
# 遍历当前电影的所有相似电影
for nearMov, nearPear in matchmov[mov]:
if nearMov in userRating.keys():
continue
if nearMov not in scores.keys():
scores[nearMov] = nearPear * rating
totalSco[nearMov] = nearPear
scores[nearMov] += nearPear * rating
totalSco[nearMov] += nearPear
rankings = [(scores[nearMov]/totalSco[nearMov],nearMov) for nearMov in scores.keys() if totalSco[nearMov] != 0]
rankings.sort(key=lambda x:x[0], reverse=True)
recommendMov = [rankings[i][1] for i in range(k)]
return recommendMov

7. 输出结果

输入 用户id,即可得到5部推荐电影。 输入 exit 退出。

1
2
3
4
5
6
7
8
9
➜ movieLens git:(master) ✗ python movieLens.py
input userid:1
near: ['Best of the Best 3: No Turning Back (1995)', 'Senseless (1998)', 'Shadow, The (1994)', 'Turbulence (1997)', 'Fear of a Black Hat (1993)']
input userid:12
near: ['Wild Bill (1995)', 'Year of the Horse (1997)', 'Telling Lies in America (1997)', 'In Love and War (1996)', 'Metisse (Café au Lait) (1993)']
input userid:13
near: ['Caro Diario (Dear Diary) (1994)', 'Even Cowgirls Get the Blues (1993)', 'Senseless (1998)', 'Love Serenade (1996)', 'Herbie Rides Again (1974)']
input userid:exit
➜ movieLens git:(master) ✗

8. 其他

8.1.读取 u.item 解码错误
1
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

总不能把原文件的错误地方找出来删了吧。

解决办法:

1
for line in open("data/ml-100k/u.item", encoding='ISO-8859-1').readlines():

新增 encoding='ISO-8859-1'

8.2. sys.setdefaultencoding(‘utf-8’) 报错

AttributeError: module 'sys' has no attribute 'setdefaultencoding'

python 3.3 开始已经移除了这个方法。默认 ‘utf-8’ 编码。

可以通过 sys.getdefaultencoding() 得知:

1
2
3
4
print(sys.getdefaultencoding())
➜ movieLens git:(master) ✗ python movieLens.py
utf-8

9. 代码

# -*- coding: utf-8 -*-
# !/usr/bin/env python

import sys
import numpy as np
import pandas as pd

def loadTrainSet():
    movieList = {}
    for line in open("data/ml-100k/u.item", encoding='ISO-8859-1').readlines():
        # u.item 包含电影id,电影名称
        (movieId, title) = line.split('|')[0:2]
        movieList[movieId] = title

    userInfo = {}
    userInfoUid = {}
    for line in open("data/ml-100k/u.data").readlines():
        # u.data 包含用户id,电影id,评分
        (uid, mid, rating) = line.split('\t')[0:3]
        if uid not in userInfoUid.keys():
            userInfoUid[uid] = {}
        userInfoUid[uid][movieList[mid]] = int(rating)

        if movieList[mid] not in userInfo.keys():
            userInfo[movieList[mid]] = {}
        userInfo[movieList[mid]][uid] = int(rating)
    return     movieList,userInfo,userInfoUid        

# 计算距离
# 皮尔逊相关系数评价:
def pearson(data, movie1, movie2):
    personList = [person for person in data[movie1].keys() if person in data[movie2].keys()]
    pLen = len(personList)    
    if pLen == 0:
        return 0
    # print('movie1:',movie1)
    # print('movie2:',movie2)
    # print('personList:',personList)
    # print('pLen:',pLen)

    rating1 = [data[movie1][p] for p in personList]
    rating2 = [data[movie2][p] for p in personList]
    ratingSq1 = [data[movie1][p]**2 for p in personList]
    ratingSq2 = [data[movie2][p]**2 for p in personList]    

    # 计算评价和 评价平方和 评价成绩和
    sum1 = sum(rating1)
    sum2 = sum(rating2)
    sumSq1 = sum(ratingSq1)
    sumSq2 = sum(ratingSq2)
    psum = sum([data[movie1][p] * data[movie2][p] for p in personList])

    # 皮尔逊相关系数计算
    num = psum - (sum1 * sum2) / pLen
    den = np.sqrt((sumSq1 - np.square(sum1)/pLen) * (sumSq2 - np.square(sum2)/pLen))

    if den == 0:
        return 0
    return num/den

def topRating(data, movie, k = 5):
    # 计算该电影与每部电影之间的皮尔逊相关系数
    scores = {}
    for mov in data.keys():
        if mov != movie:
            scores[mov] = pearson(data, movie, mov)
    scoSorted = sorted(scores.items(),key=lambda scores:scores[1],reverse=True)        
    # print('movie {0}, scoSorted: top {1}, {2}'.format(movie, k, scoSorted[:k]))
    return scoSorted[:k]

def getMovieList(data):
    matchMovieList = {}
    for mov in data.keys():
        matchMovieList[mov] = topRating(data, mov, 5)
    return matchMovieList

def getRecommendMov(data, matchmov, userid, k=5):
    try:
        userRating = data[userid]
    except KeyError:
        print('No User')
        return 0
    scores = {}  #记录加权和    
    totalSco = {} #记录评分和

    # 用户所有评过分的电影
    for mov, rating in userRating.items():
        # 遍历当前电影的所有相似电影
        for nearMov, nearPear in matchmov[mov]:
            if nearMov in userRating.keys():
                continue
            if nearMov not in scores.keys():
                scores[nearMov] = nearPear * rating
                totalSco[nearMov] = nearPear
            scores[nearMov] += nearPear * rating
            totalSco[nearMov] += nearPear

    rankings = [(scores[nearMov]/totalSco[nearMov],nearMov) for nearMov in  scores.keys() if totalSco[nearMov] != 0]
    rankings.sort(key=lambda x:x[0], reverse=True)
    recommendMov = [rankings[i][1] for i in range(k)]
    return recommendMov

def movielensClass():
    movieList,userInfo,userInfoUid = loadTrainSet()
    matchmov = getMovieList(userInfo)
    return matchmov,userInfo,userInfoUid


if __name__ == '__main__':
    matchmov,userInfo,userInfoUid = movielensClass()
    while True:
        userid = input("input userid:")
        if userid == 'exit':
            break
        else:
            near = getRecommendMov(userInfoUid, matchmov, userid)
            print('near:',near)