“等疫情过去,等我回家,抱抱爸妈,拉着他们去河边散步,听他们唠叨,再也不还嘴。我爱你们,希望你们知道。”
01.
保存短评数据
通过浏览器“检查”分析,得到URL数据接口。在不断往下刷新页面的过程中,发现URL中只有“start”参数不断产生变化,依次为0,20,40,60,80---
同时,为了破解“豆瓣”的防爬虫机制,请求数据时需携带“请求头(headers)”中的“User-Agent”和“Referer”两个参数。
<p style="line-height: 18px;font-size: 14px;letter-spacing: 0px;font-family: Consolas, Inconsolata, Courier, monospace;border-radius: 0px;color: rgb(169, 183, 198);background: rgb(40, 43, 46);padding: 0.5em;margin-left: 16px;margin-right: 16px;overflow-wrap: normal !important;word-break: normal !important;overflow: auto !important;display: -webkit-box !important;"><span style="letter-spacing: 0.5px;"><span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">import</span> requests</span><br /><br /><span style="letter-spacing: 0.5px;"><span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">for</span> i <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">in</span> range(<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">0</span>,<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">200</span>,<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">20</span>):</span><br /><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># 通过浏览器检查,得到数据的URL来源链接</span></span><br /><span style="letter-spacing: 0.5px;"> url = <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'https://m.douban.com/rexxar/api/v2/gallery/topic/125573/items?'</span> </span><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'sort=new&start={}&count=20&status_full_text=1&guest_only=0&ck=null'</span>.format(i)</span><br /><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># 破解防爬虫,带上请求头</span></span><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># 这两个不能省略</span></span><br /><span style="letter-spacing: 0.5px;"> headers = {<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'User-Agent'</span>: <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0".3809.100 Safari/537.36'</span>,</span><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'Referer'</span>: <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'https://www.douban.com/gallery/topic/125573/?from=gallery_trend&sort=hot'</span>}</span><br /><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># 发送请求,获取响应</span></span><br /><span style="letter-spacing: 0.5px;"> reponse = requests.get(url, headers=headers)</span><br /><span style="letter-spacing: 0.5px;"> html = reponse.json()</span><br /><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># 解析数据,获得短评</span></span><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># 保存到本地</span></span><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">for</span> j <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">in</span> range(<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">19</span>):</span><br /><span style="letter-spacing: 0.5px;"> abs = html[<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'items'</span>][j][<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'abstract'</span>]</span><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">with</span> open(<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">"want_after.txt"</span>, <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">"a"</span>, encoding=<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'utf-8'</span>) <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">as</span> f:</span><br /><span style="letter-spacing: 0.5px;"> f.write(abs)</span><br /><span style="letter-spacing: 0.5px;"> print(abs)</span><br /></p>
02.
词云可视化
把数据保存之后,需要利用“jieba”对数据进行分词;进而,通过分词后的数据绘制词云“wordcloud”,可视化展示数据。
<p style="line-height: 18px;font-size: 14px;letter-spacing: 0px;font-family: Consolas, Inconsolata, Courier, monospace;border-radius: 0px;color: rgb(169, 183, 198);background: rgb(40, 43, 46);padding: 0.5em;margin-left: 16px;margin-right: 16px;overflow-wrap: normal !important;word-break: normal !important;overflow: auto !important;display: -webkit-box !important;"><span style="letter-spacing: 0.5px;"><span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">from</span> wordcloud <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">import</span> WordCloud</span><br /><span style="letter-spacing: 0.5px;"><span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">import</span> matplotlib.pyplot <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">as</span> plt</span><br /><span style="letter-spacing: 0.5px;"><span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">import</span> pandas <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">as</span> pd</span><br /><span style="letter-spacing: 0.5px;"><span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">import</span> jieba</span><br /><br /><br /><span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);letter-spacing: 0.5px;overflow-wrap: inherit !important;word-break: inherit !important;"># 获得wordcloud 需要的 文本格式</span><br /><span style="letter-spacing: 0.5px;"><span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">with</span> open(<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">"want_after.txt"</span>, <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">"r"</span>, encoding=<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'utf-8'</span>) <span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">as</span> f:</span><br /><span style="letter-spacing: 0.5px;"> text = <span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">' '</span>.join(jieba.cut(f.read(),cut_all=<span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">False</span>))</span><br /><span style="letter-spacing: 0.5px;"> <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># print(text)</span></span><br /><br /><span style="letter-spacing: 0.5px;">backgroud_Image = plt.imread(<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'豆瓣.jpg'</span>) <span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);overflow-wrap: inherit !important;word-break: inherit !important;"># 背景图</span></span><br /><br /><span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);letter-spacing: 0.5px;overflow-wrap: inherit !important;word-break: inherit !important;"># 词云的一些参数设置</span><br /><span style="letter-spacing: 0.5px;">wc = WordCloud(</span><br /><span style="letter-spacing: 0.5px;"> background_color=<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'white'</span>,</span><br /><span style="letter-spacing: 0.5px;"> mask=backgroud_Image,</span><br /><span style="letter-spacing: 0.5px;"> font_path=<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'SourceHanSerifCN-Medium.otf'</span>,</span><br /><span style="letter-spacing: 0.5px;"> max_words=<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">200</span>,</span><br /><span style="letter-spacing: 0.5px;"> max_font_size=<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">200</span>,</span><br /><span style="letter-spacing: 0.5px;"> min_font_size=<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">8</span>,</span><br /><span style="letter-spacing: 0.5px;"> random_state=<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">50</span>,</span><br /><span style="letter-spacing: 0.5px;"> )</span><br /><br /><span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);letter-spacing: 0.5px;overflow-wrap: inherit !important;word-break: inherit !important;"># 生成词云</span><br /><span style="letter-spacing: 0.5px;">word_cloud = wc.generate_from_text(text)</span><br /><br /><span style="letter-spacing: 0.5px;">plt.imshow(word_cloud)</span><br /><span style="letter-spacing: 0.5px;">plt.axis(<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'off'</span>)</span><br /><br /><span style="letter-spacing: 0.5px;">wc.to_file(<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'结果.jpg'</span>)</span></p>
通过词云,可以直观的看到“吃火锅”、“电影”、“朋友”、“奶茶”、“拥抱”、“疫情”等高频的关键词。
这也代表了我们大多数人的心愿。
03.
高频词统计
<p style="line-height: 18px;font-size: 14px;letter-spacing: 0px;font-family: Consolas, Inconsolata, Courier, monospace;border-radius: 0px;color: rgb(169, 183, 198);background: rgb(40, 43, 46);padding: 0.5em;margin-left: 16px;margin-right: 16px;overflow-wrap: normal !important;word-break: normal !important;overflow: auto !important;display: -webkit-box !important;"><span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);letter-spacing: 0.5px;overflow-wrap: inherit !important;word-break: inherit !important;"># 看看词频高的有哪些</span><br /><span style="letter-spacing: 0.5px;">process_word = WordCloud.process_text(wc, text)</span><br /><span style="letter-spacing: 0.5px;">sort = sorted(process_word.items(), key=<span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">lambda</span> e: e[<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">1</span>], reverse=<span style="font-size: inherit;line-height: inherit;color: rgb(248, 35, 117);overflow-wrap: inherit !important;word-break: inherit !important;">True</span>)</span><br /><span style="letter-spacing: 0.5px;">sort_after = sort[:<span style="font-size: inherit;line-height: inherit;color: rgb(174, 135, 250);overflow-wrap: inherit !important;word-break: inherit !important;">50</span>]</span><br /><span style="letter-spacing: 0.5px;">print(sort_after)</span><br /><br /><span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);letter-spacing: 0.5px;overflow-wrap: inherit !important;word-break: inherit !important;"># 把数据存成csv文件</span><br /><span style="letter-spacing: 0.5px;">df = pd.DataFrame(sort_after)</span><br /><span style="font-size: inherit;line-height: inherit;color: rgb(128, 128, 128);letter-spacing: 0.5px;overflow-wrap: inherit !important;word-break: inherit !important;"># 保证不乱码</span><br /><span style="letter-spacing: 0.5px;">df.to_csv(<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'sort_after.csv'</span>, encoding=<span style="font-size: inherit;line-height: inherit;color: rgb(238, 220, 112);overflow-wrap: inherit !important;word-break: inherit !important;">'utf_8_sig'</span>)</span><br /></p>
面朝大海,春暖花开。
-END-
后台回复“阳光”
获取文中涉及的全部源码
点击阅读原文,阅读菜鸟学Python 400篇干货!
本篇文章来源于: 菜鸟学Python
本文为原创文章,版权归知行编程网所有,欢迎分享本文,转载请保留出处!
你可能也喜欢
- ♥ Python函数调用跟踪装饰器11/21
- ♥ 如何在python中实现日期加减08/12
- ♥ Windows下下载安装python需要注意什么?12/14
- ♥ python需要定义变量吗?09/16
- ♥ Python中检查列表中是否存在重复元素的三种方法08/29
- ♥ 卸载后如何重新安装Python09/06
内容反馈