python scrapy爬取知乎问题和收藏夹下所有答案的内容和图片

上文介绍了爬取知乎问题信息的整个过程,这里介绍下爬取问题下所有答案的内容和图片,大致过程相同,部分核心代码不同.

爬取一个问题的所有内容流程大致如下:

一个问题url
请求url,获取问题下的答案个数(我不需要,因为之前获取问题信息的时候保存了问题的回答个数)
通过答案的接口去获取答案(如果一次获取5个答案,总计100个答案,需要计算的出访问20次答案接口)[答案的接口地址如下图所示]
答案接口返回的内容保存到mysql
提取内容中的图片地址,保存到本地

爬取代码:

从mysql库中查到question的id, 然后直接访问答案接口去获取数据.

answer_template="https://www.zhihu.com/api/v4/questions/%s/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_

comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;dat

a[*].author.follower_count,badge[?(type=best_answerer)].topics&limit=5&offset=%s&sort_by=default"

    def check_login(self, response):

         #从mysql中读取question的信息,来进行爬取

         db = MySQLdb.connect("localhost", "root", "", "crawl", charset='utf8' )

         cursor = db.cursor()

         selectsql="select questionid,answer_num from  zhihu_question where id in ( 251,138,93,233,96,293,47,24,288,151,120,311,214,33) ;"

         try:

             cursor.execute(selectsql)

             results = cursor.fetchall()

             for row in results:

                 questionid = row[0]

                 answer_num = row[1]

                 fornum = answer_num/5 #计算需要访问答案接口的次数

                 print("questionid : "+ str(questionid)+"   answer_Num: "+str(answer_num))

                 for i in range(fornum+1):

                     answer_url = self.answer_template % (str(questionid), str(i*5))

                     yield scrapy.Request(answer_url,callback=self.parse_answer, headers=self.headers)

         except Exception as e:

             print(e)

         db.close()

解析response

parser_anser解析接口里的内容,这里就比较方便了, 因为是json格式的
代码如下:

def parse_answer(self,response):

        #测试时把返回结果写到本地, 然后写pythonmain方法测试,测试方法都在test_code目录下

        #temfn= str(random.randint(0,100))

        #f = open("/var/www/html/scrapy/answer/"+temfn,'wb')

        #f.write(response.body)

        #f.write("------")

        #f.close()

        res=json.loads(response.text)

        #print (res)

        data=res['data']

        # 一次返回多个(默认5个)答案, 需要遍历

        for od in data:

            #print(od)

            item = AnswerItem()

            item['answer_id']=str(od['id'])  #  answer id

            item['question_id']=str(od['question']['id'])

            item['question_title']=od['question']['title']

            item['author_url_token']=od['author']['url_token']

            item['author_name']=od['author']['name']

            item['voteup_count']=str(od['voteup_count'])

            item['comment_count']=str(od["comment_count"])

            item['content']=od['content']

            yield item

            testh = etree.HTML(od['content'])

            itemimg = MyImageItem()

            itemimg['question_answer_id'] = str(od['question']['id'])+"/"+str(od['id'])

            itemimg['image_urls']=testh.xpath("//img/@data-original")

            yield itemimg

成果展示

爬取了4w+个答案和12G图片(个人服务器只有12G空间了~)

爬取收藏夹下的答案内容和图片:

爬取收藏夹下的回答的流程和爬取问题下回答基本流程一样,区别在于:

python scrapy爬取知乎问题和收藏夹下所有答案的内容和图片

爬取一个问题的所有内容流程大致如下:

爬取代码:

解析response

成果展示

爬取收藏夹下的答案内容和图片:

构造每页的起始地址:

解析html核心代码:

python scrapy爬取知乎问题和收藏夹下所有答案的内容和图片的相关教程结束。

相关推荐

linux下在某行的前一行或后一行添加内容

[HTML Q&A][转]使pre的内容自动换行

使pre的内容自动换行

使pre的内容自动换行(转)

【python爬虫】对于微博用户发表文章内容和评论的爬取

Troubleshooting 专题 - 问正确的问题得到正确的答案

[转帖] Linux运维基础知识学习内容

c++ primer plus(文章6版本)中国版编程练习答案第八章

python scrapy爬取知乎问题和收藏夹下所有答案的内容和图片

爬取一个问题的所有内容流程大致如下:

爬取代码:

解析response

成果展示

爬取收藏夹下的答案内容和图片:

构造每页的起始地址:

解析html核心代码:

python scrapy爬取知乎问题和收藏夹下所有答案的内容和图片的相关教程结束。

相关推荐

linux下在某行的前一行或后一行添加内容

[HTML Q&A][转]使pre的内容自动换行

使pre的内容自动换行

使pre的内容自动换行(转)

【python爬虫】对于微博用户发表文章内容和评论的爬取

Troubleshooting 专题 - 问正确的问题 得到正确的答案

[转帖] Linux运维基础知识学习内容

c++ primer plus(文章6版本)中国版 编程练习答案第八章

Troubleshooting 专题 - 问正确的问题得到正确的答案

c++ primer plus(文章6版本)中国版编程练习答案第八章