Python小练习——电影数据集TMDB预处理
加载TMDB数据集,进行数据预处理
TMDb电影数据库,数据集中包含来自1960-2016年上映的近11000部电影的基本信息,主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。用于练习数据分析。
参考文章https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2
import pandas as pd credits = pd.read_csv("./tmdb_5000_credits.csv") movies = pd.read_csv("./tmdb_5000_movies.csv")
查看各个dataframe的一般信息
# 这是movies表的信息 movies.head(1) Out[3]: budget genres homepage id ... tagline title vote_average vote_count 0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800
这是credits表的信息
print(credits.info()) credits.head(1) Out[4]: <class "pandas.core.frame.DataFrame"> RangeIndex: 4803 entries, 0 to 4802 Data columns (total 4 columns): movie_id 4803 non-null int64 title 4803 non-null object cast 4803 non-null object crew 4803 non-null object dtypes: int64(1), object(3) memory usage: 150.2+ KB None movie_id ... crew 0 19995 ... [{"credit_id": "52fe48009251416c750aca23", "de...
credits表的cast列很奇怪,数据很多
进行具体查看
# 查看credists表的cast列索引0的值,发现是一长串东西 print("cast格式:", type(credits["cast"][0])) # 查看其类型,为`str`类型,无法处理 Out[5]: cast格式: <class "str">
json格式化数据处理 从表中看出,cast列其实是json格式化数据,应该用json包进行处理
json格式是[{},{}]
将json格式的字符串转换成Python对象用json.loads()
json.load()
针对的是文件,从文件中读取json
import json type(json.loads(credits["cast"][0])) Out[6]: list
从上面可以看出json.loads()
将json字符串转成了list,可以知道list里面又包裹多个dict
接下来批量处理
import json json_col = ["cast","crew"] for i in json_col: credits[i] = credits[i].apply(json.loads) >> credits["cast"][0][:3] Out[7]: [{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}] print("再次查看cast类型是:",type(credits["cast"][0])) # 数据类型变成了list,可以用于循环处理 Out[8]: 再次查看cast类型是: <class "list">
提取其中的名字
credits["cast"][0][:3] # credits第一行的cast,是个列表 Out[9]: [{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}] credits["cast"][0][0]["name"] # 获取第一行第一个字典的人名 Out[10]: "Sam Worthington"
dict字典常用的函数 dict.get() 返回指定键的值,如果值不在字典中返回default值
dict.items() 以列表返回可遍历的(键, 值) 元组数组
# 代码测试如下: i = credits["cast"][0][0] for x in i.items(): print(x) Out[11]: ("cast_id", 242) ("character", "Jake Sully") ("credit_id", "5602a8a7c3a3685532001c9a") ("gender", 2) ("id", 65731) ("name", "Sam Worthington") ("order", 0)
创建get_names()函数,进一步分割cast
def get_names(x): return ",".join(i["name"] for i in x) credits["cast"] = credits["cast"].apply(get_names) credits["cast"][:3] Out[12]: 0 Sam Worthington,Zoe Saldana,Sigourney Weaver,S... 1 Johnny Depp,Orlando Bloom,Keira Knightley,Stel... 2 Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph... Name: cast, dtype: object
crew提取导演
credits["crew"][0][0] Out[13]: {"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"} # 需要创建循环,找到job是director的,然后读取名字并返回 def director(x): for i in x: if i["job"] == "Director": return i["name"] credits["crew"] = credits["crew"].apply(director) print(credits[["crew"]][:3]) credits.rename(columns = {"crew":"director"},inplace=True) #修改列名 credits[["director"]][:3] Out[[14]: crew 0 James Cameron 1 Gore Verbinski 2 Sam Mendes
movies表进行json解析
>>> movies.head(1) Out[15]: budget genres homepage id ... tagline title vote_average vote_count 0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800
可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的
# 方法同crew表 json_col = ["genres","keywords","spoken_languages","production_countries","production_companies"] for i in json_col: movies[i] = movies[i].apply(json.loads) movies[i] = movies[i].apply(get_names) >>> movies.head(1) Out[16]: budget genres homepage id ... tagline title vote_average vote_count 0 237000000 Action,Adventure,Fantasy,Science Fiction http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800