m5k1umn 发表于 2024-10-26 16:48:53

Python爬虫实践--爬取网易云音乐


    <h1 style="color: black; text-align: left; margin-bottom: 10px;">前言</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">近期</span>,网易的音乐<span style="color: black;">非常多</span>听不到了,刚好<span style="color: black;">亦</span>看到<span style="color: black;">非常多</span>教程,跟进学习了一下,<span style="color: black;">亦</span>集大全了吧,本来想优化一下的,<span style="color: black;">然则</span><span style="color: black;">发掘</span>问题还是有点<span style="color: black;">繁杂</span>,最后另辟捷径,<span style="color: black;">供给</span>了简单的<span style="color: black;">办法</span>啊!</span></span></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">Python + 爬虫</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">首要</span>,说一下准备工作:</span></span></p><span style="color: black;"><span style="color: black;">Python:需要基本的python语法<span style="color: black;">基本</span></span></span><span style="color: black;"><span style="color: black;">requests:专业用于请求处理,requests库学习文档中文版</span></span><span style="color: black;"><span style="color: black;">lxml:其实<span style="color: black;">能够</span>用python自带的正则表达式库re,<span style="color: black;">然则</span>为了更加简单入门,用 lxml 中的 etree 进行网页数据定位爬取。</span></span><span style="color: black;"><span style="color: black;">re:python正则表达式处理</span></span><span style="color: black;"><span style="color: black;">
            <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">json:python的json处理库</p><span style="color: black;">而后</span>,说一下<span style="color: black;">此刻</span><span style="color: black;">已然</span><span style="color: black;">晓得</span>下载链接是<span style="color: black;">这般</span>的:
      </span></span><span style="color: black;">http</span>:<span style="color: black;">//music.163.com/song/media/outer/url?id=</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">id <span style="color: black;">便是</span>歌曲的id!</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">因此</span>,<span style="color: black;">此刻</span><span style="color: black;">咱们</span>爬虫<span style="color: black;">重点</span>的工作<span style="color: black;">便是</span>找到这个id,当然为了更好的<span style="color: black;">保留</span>,<span style="color: black;">亦</span>要找到这个歌名啦!</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">那<span style="color: black;">此刻</span><span style="color: black;">便是</span>要找到<span style="color: black;">咱们</span>需要爬虫的网站链接啦!我分析了一下,大概是下面三种:</span></span></p><span style="color: black;">#歌曲<span style="color: black;">名单</span></span>
    <span style="color: black;">music_list</span> = <span style="color: black;">https://music.163.com/#/playlist?id=2412826586</span>
    <span style="color: black;">#歌手排行榜</span>
    <span style="color: black;">artist_list</span> = <span style="color: black;">https://music.163.com/#/artist?id=8325</span>
    <span style="color: black;">#搜索列表 </span>
    <span style="color: black;">search_list</span> = <span style="color: black;">https://music.163.com/#/search/m/?order=hot&amp;cat=全部&amp;limit=435&amp;offset=435&amp;s=梁静茹</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">倘若</span>你<span style="color: black;">已然</span>只是想下载一首歌,<span style="color: black;">例如</span>静茹-勇气:https://music.163.com/#/song?id=254485,那你直接就用浏览器打开 </p>http://music.163.com/song/media/outer/url?id=254485 就<span style="color: black;">能够</span>了,<span style="color: black;">不必</span>爬虫啊!
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">下载歌词</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">倘若</span>还要下载歌词,那<span style="color: black;">亦</span>很简单,<span style="color: black;">经过</span>接口,有歌曲的id就<span style="color: black;">能够</span>:</span></span></p><span style="color: black;">url</span> = <span style="color: black;">http://music.163.com/api/song/lyric?id={}&amp;lv=-1&amp;kv=-1&amp;tv=-1</span>.format(song_id)
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">返回的json数据大概长<span style="color: black;">这般</span></span></span></p><span style="color: black;">{</span>
    <span style="color: black;">sgc:</span> <span style="color: black;">true</span><span style="color: black;">,</span>
    <span style="color: black;">sfy:</span> <span style="color: black;">false</span><span style="color: black;">,</span>
    <span style="color: black;">qfy:</span> <span style="color: black;">false</span><span style="color: black;">,</span>
    <span style="color: black;">lrc:</span>
    <span style="color: black;">{</span>
    <span style="color: black;">version:</span> <span style="color: black;">7</span><span style="color: black;">,</span>
    <span style="color: black;">lyric:</span> <span style="color: black;">"开了窗 等待天亮\n看这城市 悄悄的 熄了光\n听风的方向\n这一刻 <span style="color: black;">是不是</span>和我<span style="color: black;">同样</span>\n孤单的飞翔\n模糊了眼眶\n广播里 那首歌曲\n重复当时 那条街那个你\n相同的桌椅\n<span style="color: black;">不消</span>言语 就会有默契\n这份亲密\n<span style="color: black;">那样</span><span style="color: black;">熟练</span>\n在爱里 等着你\n被你疼惜 有种暖意\n在梦里 全是你\n不要再迟疑 把我抱紧"</span>
    <span style="color: black;">},</span>
    <span style="color: black;">klyric:</span>
    <span style="color: black;">{</span>
    <span style="color: black;">version:</span> <span style="color: black;">0</span><span style="color: black;">,</span>
    <span style="color: black;">lyric:</span> <span style="color: black;">null</span>
    <span style="color: black;">},</span>
    <span style="color: black;">tlyric:</span>
    <span style="color: black;">{</span>
    <span style="color: black;">version:</span> <span style="color: black;">0</span><span style="color: black;">,</span>
    <span style="color: black;">lyric:</span> <span style="color: black;">null</span>
    <span style="color: black;">},</span>
    <span style="color: black;">code:</span> <span style="color: black;">200</span>
    <span style="color: black;">}</span>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">坑点与进阶</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">表面上很简单,<span style="color: black;">然则</span>需要<span style="color: black;">重视</span>的是,网易返回的链接,数据是js动态加载,<span style="color: black;">亦</span><span style="color: black;">便是</span>爬虫得到的网页数据和浏览器得到的dom内容和结构不<span style="color: black;">同样</span>!</span></span></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">坑</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">其中,搜索列表爬虫回来的内容,完全得不到歌曲id!!!</span></span></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">处理</span></h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">处理</span><span style="color: black;">办法</span><span style="color: black;">亦</span>是有的!</span></span></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">python模拟浏览器</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">运用</span>selenium+phantomjs无界面浏览器,这两者的结合其实<span style="color: black;">便是</span>直接操作浏览器,<span style="color: black;">能够</span>获取JavaScript渲染后的页面数据。</span></span></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">缺点</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">因为</span>是无界面浏览器,采用此<span style="color: black;">方法</span>效率极低,<span style="color: black;">倘若</span>大批量抓取不<span style="color: black;">举荐</span>。</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">针对</span>异步请求并且数据在源码中并不存在的,<span style="color: black;">同期</span><span style="color: black;">亦</span>就<span style="color: black;">没法</span>抓取到的数据。</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">搜索的歌曲变成歌单</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">例如</span>想下载<span style="color: black;">所有</span>的某一歌手的<span style="color: black;">所有</span>音乐,用手机云音乐搜索,<span style="color: black;">而后</span><span style="color: black;">所有</span><span style="color: black;">保留</span>到新建一个歌单,<span style="color: black;">这般</span>就<span style="color: black;">能够</span>啦!</span></span></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">总结</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">用python,就<span style="color: black;">必定</span>要简单,我认为<span style="color: black;">繁杂</span>的东西,还是<span style="color: black;">尽可能</span>少做,能取巧就取巧,<span style="color: black;">因此</span>本文<span style="color: black;">无</span><span style="color: black;">运用</span>selenium+phantomjs实践。</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">注:本文只是技术交流,请不要<span style="color: black;">商场</span>用途~ 如有违反,<span style="color: black;">自己</span>一概不负责。</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">所有</span>代码</p>又是非常简单的100行代码完事!!!
    <span style="color: black;">import</span> os
    <span style="color: black;">import</span> re
    <span style="color: black;">import</span>json<span style="color: black;">import</span> requests
    <span style="color: black;">from</span> lxml <span style="color: black;">import</span> etree


    <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">download_songs</span><span style="color: black;">(url=None)</span>:</span>
    <span style="color: black;">if</span> url <span style="color: black;">is</span> <span style="color: black;">None</span>:
    url = <span style="color: black;">https://music.163.com/#/playlist?id=2384642500</span>

    url = url.replace(<span style="color: black;">/#</span>, ).replace(<span style="color: black;">https</span>, <span style="color: black;">http</span>) <span style="color: black;"># 对字符串进行去空格和转协议处理</span>
    <span style="color: black;"># 网易云音乐外链url接口:http://music.163.com/song/media/outer/url?id=xxxx</span>
    out_link = <span style="color: black;">http://music.163.com/song/media/outer/url?id=</span>
    <span style="color: black;"># 请求头</span>
    headers = {
    <span style="color: black;">User-Agent</span>: <span style="color: black;">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36</span>,
    <span style="color: black;">Referer</span>: <span style="color: black;">https://music.163.com/</span>,
    <span style="color: black;">Host</span>: <span style="color: black;">music.163.com</span>
    }
    <span style="color: black;"># 请求页面的源码</span>
    res = requests.get(url=url, headers=headers).text

    tree = etree.HTML(res)
    <span style="color: black;"># 音乐列表</span>
    song_list = tree.xpath(<span style="color: black;">//ul[@class="f-hide"]/li/a</span>)
    <span style="color: black;"># <span style="color: black;">倘若</span>是歌手页面</span>
    artist_name_tree = tree.xpath(<span style="color: black;">//h2[@id="artist-name"]/text()</span>)
    artist_name = str(artist_name_tree[<span style="color: black;">0</span>]) <span style="color: black;">if</span> artist_name_tree <span style="color: black;">else</span> <span style="color: black;">None</span>

    <span style="color: black;"># <span style="color: black;">倘若</span>是歌单页面:</span>
    <span style="color: black;">#song_list_tree = tree.xpath(//*[@id="m-playlist"]/div/div/div/div/div/div/div/table/tbody)</span>
    song_list_name_tree = tree.xpath(<span style="color: black;">//h2/text()</span>)
    song_list_name = str(song_list_name_tree[<span style="color: black;">0</span>]) <span style="color: black;">if</span> song_list_name_tree <span style="color: black;">else</span> <span style="color: black;">None</span>

    <span style="color: black;"># 设置音乐下载的文件夹为歌手名字或歌单名</span>
    folder = <span style="color: black;">./</span> + artist_name <span style="color: black;">if</span> artist_name <span style="color: black;">else</span> <span style="color: black;">./</span> + song_list_name

    <span style="color: black;">if</span> <span style="color: black;">not</span> os.path.exists(folder):
    os.mkdir(folder)

    <span style="color: black;">for</span> i, s <span style="color: black;">in</span>enumerate(song_list):
    href = str(s.xpath(<span style="color: black;">./@href</span>)[<span style="color: black;">0</span>])
    song_id = href.split(<span style="color: black;">=</span>)[<span style="color: black;">-1</span>]
    src = out_link + song_id <span style="color: black;"># 拼接获取音乐真实的src资源值</span>
    title = str(s.xpath(<span style="color: black;">./text()</span>)[<span style="color: black;">0</span>]) <span style="color: black;"># 音乐的名字</span>
    filename = title + <span style="color: black;">.mp3</span>
    filepath = folder + <span style="color: black;">/</span> + filename
    print(<span style="color: black;"><span style="color: black;">起始</span>下载第{}首音乐:{}\n</span>.format(i + <span style="color: black;">1</span>, filename))

    <span style="color: black;">try</span>: <span style="color: black;"># 下载音乐</span>
    <span style="color: black;">#下载歌词</span>
    <span style="color: black;">#download_lyric(title, song_id)</span>

    data = requests.get(src).content <span style="color: black;"># 音乐的二进制数据</span>

    <span style="color: black;">with</span> open(filepath, <span style="color: black;">wb</span>) <span style="color: black;">as</span> f:
    f.write(data)
    <span style="color: black;">except</span> Exception <span style="color: black;">as</span> e:
    print(e)

    print(<span style="color: black;">{}首<span style="color: black;">所有</span>歌曲<span style="color: black;">已然</span>下载完毕!</span>.format(len(song_list)))


    <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">download_lyric</span><span style="color: black;">(song_name, song_id)</span>:</span>
    url = <span style="color: black;">http://music.163.com/api/song/lyric?id={}&amp;lv=-1&amp;kv=-1&amp;tv=-1</span>.format(song_id)
    <span style="color: black;"># 请求头</span>
    headers = {
    <span style="color: black;">User-Agent</span>: <span style="color: black;">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36</span>,
    <span style="color: black;">Referer</span>: <span style="color: black;">https://music.163.com/</span>,
    <span style="color: black;">Host</span>: <span style="color: black;">music.163.com</span>
    <span style="color: black;"># Origin: https://music.163.com</span>
    }
    <span style="color: black;"># 请求页面的源码</span>res = requests.get(url=url, headers=headers).text
    json_obj = json.loads(res)
    lyric = json_obj[<span style="color: black;">lrc</span>][<span style="color: black;">lyric</span>]
    reg = re.compile(<span style="color: black;">r\[.*\]</span>)
    lrc_text = re.sub(reg, , lyric).strip()

    print(song_name, lrc_text)<span style="color: black;">if</span> __name__ == <span style="color: black;">__main__</span>:
    <span style="color: black;">#music_list = https://music.163.com/#/playlist?id=2384642500 #歌曲<span style="color: black;">名单</span></span>
    music_list = <span style="color: black;">https://music.163.com/#/artist?id=8325</span> <span style="color: black;">#歌手排行榜</span>
    <span style="color: black;"># music_list = https://music.163.com/#/search/m/?order=hot&amp;cat=<span style="color: black;">所有</span>&amp;limit=435&amp;offset=435&amp;s=梁静茹 #搜索列表</span>
    download_songs(music_list)
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">如有疑问,欢迎在评论区<span style="color: black;">一块</span>讨论!</p>如有不正确的<span style="color: black;">地区</span>,欢迎<span style="color: black;">指点</span>!
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/01c7fc9a081b4d38a058ae74f9312991~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1729477222&amp;x-signature=mmiLKZI7MFBxFD59ihT6s5IIUmo%3D" style="width: 50%; margin-bottom: 20px;"></div>




页: [1]
查看完整版本: Python爬虫实践--爬取网易云音乐