Python小练习——电影数据集TMDB预处理

Python小练习——电影数据集TMDB预处理[Python常见问题]

加载TMDB数据集,进行数据预处理

TMDb电影数据库,数据集中包含来自1960-2016年上映的近11000部电影的基本信息,主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。用于练习数据分析。

参考文章https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2

import pandas as pd

credits = pd.read_csv("./tmdb_5000_credits.csv")
movies = pd.read_csv("./tmdb_5000_movies.csv")
查看各个dataframe的一般信息
# 这是movies表的信息
movies.head(1)

Out[3]: 
      budget                                             genres                     homepage     id    ...                          tagline   title vote_average vote_count
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...  http://www.avatarmovie.com/  19995    ...      Enter the World of Pandora.  Avatar          7.2      11800

 

 

这是credits表的信息

print(credits.info())
credits.head(1)

Out[4]: 
    <class "pandas.core.frame.DataFrame">
    RangeIndex: 4803 entries, 0 to 4802
    Data columns (total 4 columns):
    movie_id    4803 non-null int64
    title       4803 non-null object
    cast        4803 non-null object
    crew        4803 non-null object
    dtypes: int64(1), object(3)
    memory usage: 150.2+ KB
    None

   movie_id                        ...                                                                       crew
0     19995                        ...                          [{"credit_id": "52fe48009251416c750aca23", "de...

 

 

credits表的cast列很奇怪,数据很多
进行具体查看

# 查看credists表的cast列索引0的值,发现是一长串东西
print("cast格式:", type(credits["cast"][0])) # 查看其类型,为`str`类型,无法处理
Out[5]:
    cast格式: <class "str">

 

 

json格式化数据处理 从表中看出,cast列其实是json格式化数据,应该用json包进行处理
json格式是[{},{}]
将json格式的字符串转换成Python对象用json.loads()
json.load()针对的是文件,从文件中读取json

import json
type(json.loads(credits["cast"][0]))
Out[6]:
    list

 

 

从上面可以看出json.loads()将json字符串转成了list,可以知道list里面又包裹多个dict
接下来批量处理

import json
json_col = ["cast","crew"]
for i in json_col:
    credits[i] = credits[i].apply(json.loads)

>> credits["cast"][0][:3]

Out[7]:
    [{"cast_id": 242,
      "character": "Jake Sully",
      "credit_id": "5602a8a7c3a3685532001c9a",
      "gender": 2,
      "id": 65731,
      "name": "Sam Worthington",
      "order": 0},
     {"cast_id": 3,
      "character": "Neytiri",
      "credit_id": "52fe48009251416c750ac9cb",
      "gender": 1,
      "id": 8691,
      "name": "Zoe Saldana",
      "order": 1},
     {"cast_id": 25,
      "character": "Dr. Grace Augustine",
      "credit_id": "52fe48009251416c750aca39",
      "gender": 1,
      "id": 10205,
      "name": "Sigourney Weaver",
      "order": 2}]
print("再次查看cast类型是:",type(credits["cast"][0])) 
# 数据类型变成了list,可以用于循环处理

Out[8]:
    再次查看cast类型是: <class "list">

 

提取其中的名字

credits["cast"][0][:3]
# credits第一行的cast,是个列表

Out[9]:
    [{"cast_id": 242,
      "character": "Jake Sully",
      "credit_id": "5602a8a7c3a3685532001c9a",
      "gender": 2,
      "id": 65731,
      "name": "Sam Worthington",
      "order": 0},
     {"cast_id": 3,
      "character": "Neytiri",
      "credit_id": "52fe48009251416c750ac9cb",
      "gender": 1,
      "id": 8691,
      "name": "Zoe Saldana",
      "order": 1},
     {"cast_id": 25,
      "character": "Dr. Grace Augustine",
      "credit_id": "52fe48009251416c750aca39",
      "gender": 1,
      "id": 10205,
      "name": "Sigourney Weaver",
      "order": 2}]
credits["cast"][0][0]["name"] # 获取第一行第一个字典的人名

Out[10]:

    "Sam Worthington"

 

dict字典常用的函数 dict.get() 返回指定键的值,如果值不在字典中返回default值
dict.items() 以列表返回可遍历的(键, 值) 元组数组

# 代码测试如下:
i = credits["cast"][0][0]
for x in i.items():
    print(x)

Out[11]:
    ("cast_id", 242)
    ("character", "Jake Sully")
    ("credit_id", "5602a8a7c3a3685532001c9a")
    ("gender", 2)
    ("id", 65731)
    ("name", "Sam Worthington")
    ("order", 0)

 

 

创建get_names()函数,进一步分割cast

def get_names(x):
   return ",".join(i["name"] for i in x)
credits["cast"] = credits["cast"].apply(get_names)
credits["cast"][:3]

Out[12]:
    0    Sam Worthington,Zoe Saldana,Sigourney Weaver,S...
    1    Johnny Depp,Orlando Bloom,Keira Knightley,Stel...
    2    Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph...
    Name: cast, dtype: object

 

 

crew提取导演

credits["crew"][0][0]
Out[13]:
    {"credit_id": "52fe48009251416c750aca23",
     "department": "Editing",
     "gender": 0,
     "id": 1721,
     "job": "Editor",
     "name": "Stephen E. Rivkin"}
# 需要创建循环,找到job是director的,然后读取名字并返回
def director(x):
    for i in x:
        if i["job"] == "Director":
            return i["name"]

credits["crew"] = credits["crew"].apply(director)
print(credits[["crew"]][:3])
credits.rename(columns = {"crew":"director"},inplace=True) #修改列名
credits[["director"]][:3]

Out[[14]:
    crew
    0   James Cameron
    1  Gore Verbinski
    2      Sam Mendes

 

movies表进行json解析

>>> movies.head(1)
Out[15]:
      budget                                             genres                     homepage     id    ...                          tagline   title vote_average vote_count
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...  http://www.avatarmovie.com/  19995    ...      Enter the World of Pandora.  Avatar          7.2      11800

 

 

可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的

# 方法同crew表
json_col = ["genres","keywords","spoken_languages","production_countries","production_companies"]
for i in json_col:
    movies[i] = movies[i].apply(json.loads)
    movies[i] = movies[i].apply(get_names)
>>> movies.head(1) 
Out[16]:
      budget                                    genres                     homepage     id    ...                          tagline   title vote_average vote_count
0  237000000  Action,Adventure,Fantasy,Science Fiction  http://www.avatarmovie.com/  19995    ...      Enter the World of Pandora.  Avatar          7.2      11800

 

 
这样,就把数据预处理做完了。