Dust8 的博客

读书百遍其义自见

0%

前言

昨天有 _ boss _ 给我出了一道爬虫题,让我爬网易新闻的评论,我说试下。

分析

一级入口

网易新闻的入口:_ news.163.com _

二级人口:


点击进去查看源代码,发现没有内容,可以知道它是 动态加载 的。
通过 _ Network _ 里面的 _ XHR _,并没有发现请求,查看 JS
发现一个可疑请求

就是它了 http://temp.163.com/special/00804KVA/cm_guonei.js?callback=data_callback
通过它就可以得到 guonei 的最近新闻内容)。

注意这里 cm_guonei 和开始进来的 domestic ,它们名字并不统一

评论入口

同理可以发现

  • 最热评论 hotList
  • 最新评论 newList

注意每个二级的评论接口有可能不一样。比如国内新闻的评论接口是 http://comment.news.163.com
而军事评论接口是 http://comment.war.163.com

代码

用 _ scrapy _ 来写也很简单,平常那种连续 2 次解析就可以了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import re
import json
import scrapy

from ..items import NewsItem, CommentItem


class NewsSpider(scrapy.Spider):
name = 'news'
CM_API_BASE_URL = 'http://temp.163.com/special/00804KVA/cm_{}.js?callback=data_callback'
NEW_LIST_API_BASE_URL = 'http://comment.news.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/{}/comments/newList?offset=0&limit=30&showLevelThreshold=72&headLimit=1&tailLimit=2&callback=getData&ibc=newspc'
pingdao = ['shehui', 'guoji','war']
start_urls = [CM_API_BASE_URL.format('guonei')]

def parse(self, response):
cm_re = re.compile('data_callback\((.*)\)', re.DOTALL)
m = cm_re.match(response.body.decode('gb18030'))
if m:
newses = json.loads(m.group(1))
for news in newses:
item = NewsItem()
item['channelname'] = news['channelname']
item['docurl'] = news['docurl']
item['time'] = news['time']
item['title'] = news['title']
item['threadid'] = news['docurl'][-21:-5]
yield item

request = scrapy.Request(self.NEW_LIST_API_BASE_URL.format(item['threadid']),
callback=self.parse_comment)
request.meta['threadid'] = item['threadid']
yield request

def parse_comment(self, response):
threadid = response.meta['threadid']
new_list_re = re.compile('getData\((.*)\)', re.DOTALL)
m = new_list_re.match(response.body.decode())
if m:
comments = json.loads(m.group(1))['comments']
for _, value in comments.items():
# print(value)
item = CommentItem()
item['commentid'] = value['commentId']
item['content'] = value['content']
item['createtime'] = value['createTime']
item['threadid'] = threadid
item['nickname'] = value['user'].get('nickname', 'anonymous')
item['userid'] = value['user'].get('userId', '0')
yield item

抓取结果

只抓取了一次国内新闻。共 _ 70 _ 条新闻,_ 2012 _ 条评论。

需要改进的

  • 因为抓取的都是最近新闻,可能没有多少评论,需要以后再次抓取评论来提高评论的数量
  • 过去的新闻没有去抓

前言

想转行做编程工作,平时自学python,玩过不少东西,但掌握的少之又少,看来还是要专一门
学精一些。网上投了不少简历,有一半都根本看都没看过简历,剩下的都是回复不合适,只有
昨天有一家给了面试机会,自己掌握不精,平时没怎么记,习惯用自动补全,有点丢人啊。还被
高德坑了下,说是一个半小时的路程,光在车上就花费了两个半小时。

面试题

问怎么知道python安装没有,还有就是怎么看python的版本号

这个简直是送分题

1
2
which python
python --version

给‘xxx xx xxxx’让提取数字并从大到小排序

1
2
3
4
5
6
7
8
9
line = 'dust dust8  1234 45 90'

def filter_nums(line):
str_nums = line.split()
nums = [int(num) for num in str_nums if num.isnumeric()]
nums.sort(reverse=True)
return nums

filter_nums(line)

这里我 isnumeric 没写出来,只知道有这么个函数可以判断是否是数字,还有就是漏了
int 会导致排序不正确。早知道用正则好了。

就是考察django的数据库查询,有2个限制条件

我没怎么看django,上午在吊水,下午复习了下python基本知识,晚上只看了几章django。
导致没写出来。回来看了下有3种方法。

  • filter(quantity__gt=5, price__lt=10)
  • filter(quantity__gt=5).filter(price__lt=10)
  • filter(Q(quantity__gt=5)&Q(price__lt=10))

叫用django的内置tags来写个表格

django的内置tags有哪些我还没看过,就用html写了下

1
2
3
4
<table>
<tr><th rowspan="2">订单</th><th colspan="3">流程</th></tr>
<tr><th>A</th><th>B</th><th>C</th></tr>
</table>

这里我知道有个属性可以控制它的占用的空间,也没写对,就是rowspan colspan
还有就是django内置的tags,我就写了

1
2
3
{% for x in xx %}
<tr><td>{{ x.x }}</td><td>{{ x.y }}</td></tr>
{% endfor %}

其实 for 就是django内置的一种tag,可惜我有点紧张,变量好像还忘记变量加双大括号。

有个思考题,说怎么测量路上的通车数量

回答的不是很好,写了射线类测量,人类计数,图像识别,还要思考他们的准确性和成本。

总结

不知道这几道题目是否是为我特意出的,感觉很简单,只是平时只是图好玩,没怎么记,只是记住了
流程,可以这样做,做的时候查下文档就可以弄出来了。看来平时要多记啊。

学历不高,年纪稍大,没有经验,总是追求新技术,导致什么都知道一些,做的时候要复习下,
又没有专精哪一门,不知道谁能收留下,我应该可以很快进入工作状态。

前言

最近看了下 openrestry ,它是 ngnixlua 的结合体,用于方便地搭建能够处理超高并发、扩展性极高的动态 Web 应用、Web 服务和动态网关。可以构造出足以胜任 10K 乃至 1000K 以上单机并发连接的高性能 Web 应用系统。搜索它的视频资料,一个在网易云课堂上,只有 1/3 的内容,另外一个在 stuq 上,是全部内容。我就想把视频爬下来。

分析

分析视频地址

通过 chrome 开发工具的 Elements 发现没有直接暴露地址,是用 flash 播放的,同时可以知道视频的 id 用 vid 来表示。
通过 Network 可以发现几个有用的请求。它们是 getvideofile player.swf playinfo.
真实的视频地址就在 playinfo 这个请求里面。分析这个请求的查询参数
可以知道只有 2 个参数是必须的,一个是 vid(视频 id), 一个是 m 它是固定的。当时还反编译了一下 player.swf 看了下其他参数是什么。其实这个接口没有怎么防范,通过不停的减少参数就可以把这个分析出来。

分析视频列表

查看网页源代码可以看到有视频的 id,有 2 个问题是,一个是要登录,一个是它用的是 vue 。而且也没有发现有 api 请求,所以只能用 selenium 来模拟请求了。

实现

我把抓取视频 id 和下载视频分开,用 selenium 来抓取视频 id 和文件名列表,保存为 json,在用 aiohttp 来下载视频文件,所以会快些。

stuq_spider.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import time
import json
from selenium import webdriver
from selenium.webdriver.common.keys import Keys


class StuqSpider:
def __init__(self):
self.driver = webdriver.Chrome('/Users/tom/workspace/chromedriver')

def crawl_course(self, courseid):
coursewares = self.get_coursewares(courseid)
data = []
for url in coursewares:
info = self.get_video_info(url)
data.append(info)
self.store_data(data)

def login_weibo(self, username, password):
self.driver.get('https://passport.stuq.org/user/login')
self.driver.find_element_by_xpath('//a[@class="button weibo"]').click()

userid = self.driver.find_element_by_id('userId')
userid.send_keys(username)
passwd = self.driver.find_element_by_id('passwd')
passwd.send_keys(password)

# 放慢速度好伪装
time.sleep(5)

self.driver.find_element_by_xpath(
'//*[@id="outer"]/div/div[2]/form/div/div[2]/div/p/a[1]'
).send_keys(Keys.ENTER)

# 手动登录,要输入验证码
time.sleep(20)

def get_coursewares(self, courseid):
coursewares = []
self.driver.implicitly_wait(5)
self.driver.get('http://www.stuq.org/course/' + courseid + '/study')
for a in self.driver.find_elements_by_xpath('//p[@class="video"]/a'):
coursewares.append(a.get_attribute('href'))
return coursewares

def get_video_info(self, url):
self.driver.implicitly_wait(2)
self.driver.get(url)
h2 = self.driver.find_element_by_tag_name('h2').text
videoid = self.driver.find_element_by_xpath(
'//section[@id="cc-video"]/div').get_attribute('id').split('_')[2]
return [h2, videoid]

def store_data(self, data):
with open('stuq.json', 'w') as fp:
json.dump(data, fp)

def close(self):
self.driver.close()
self.driver.quit()


username = '' # sina 账号
password = '' # sina 密码
courseid = '1015'

spider = StuqSpider()
try:
spider.login_weibo(username, password)
spider.crawl_course(courseid)
finally:
spider.close()

bokecc_crawler.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import aiohttp
import asyncio
import json
import time
import os.path
import xml.etree.ElementTree as ET

base_url = 'https://p.bokecc.com/servlet/playinfo?m=1&vid='
chunk_size = 64


class Crawler:
def __init__(self, max_task=4, loop=None):
self.max_task = 4
self.loop = loop or asyncio.get_event_loop()
self.q = asyncio.Queue(loop=self.loop)
self.session = aiohttp.ClientSession(loop=self.loop)
self.t0 = self.t1 = 0

def add_task(self, task):
self.q.put_nowait(task)

async def crawl(self):
workers = [
asyncio.Task(self.work(), loop=self.loop)
for _ in range(self.max_task)
]
self.t0 = time.time()
await self.q.join()
self.t1 = time.time()
for w in workers:
w.cancel()

async def work(self):
try:
while True:
filename, vid = await self.q.get()
print('working:', filename, vid)
xmldata = await self.fetch(base_url + vid)
root = ET.fromstring(xmldata)
copys = root.findall(".//copy")
playurl = copys[-1].attrib['playurl']
await self.download_video(playurl, filename)
self.q.task_done()
except asyncio.CancelledError:
pass

async def fetch(self, url):
async with self.session.get(url) as response:
return await response.text()

async def download_video(self, url, filename):
async with self.session.get(url) as response:
with open(os.path.join('videos', filename), 'wb') as fd:
while True:
chunk = await response.content.read(chunk_size)
if not chunk:
break
fd.write(chunk)

def close(self):
self.session.close()

def report(self):
print('total time: ', self.t1 - self.t0)


loop = asyncio.get_event_loop()
crawler = Crawler(loop=loop)

with open('stuq.json') as fp:
data = json.load(fp)

for filename, vid in data:
filename += '.flv'
crawler.add_task((filename, vid))
try:
loop.run_until_complete(crawler.crawl())
finally:
crawler.report()
crawler.close()

爬取成果

爬了 20 分钟,下了 29 个视频,共 2.1G。
stuq_spider

前言

平常喜欢写爬虫,写爬虫肯定要提取信息,如果用 正则表达式 就太麻烦了,所以很多库都
集成了 xpath 选择器。如果不太熟练或者想测试 xpath 的话就有点不方便,
网上没有比较好的插件或工具。其实 chrome 里面的开发工具就可以很好的导出和测试 xpath

chrome 开发工具

打开 chrome 开发工具有 3 种方式

  • 在页面右键 > inspect
  • 右上角设置 > More Tools > Developer Tools
  • 快捷键
    • macos: option(alt)+command+i

提取 xpath

Elements 里面选择要提取的元素,右键 Copy > Copy Xpath。这种方式虽然方便,还是有很多不足。
例如层级太深,只能选一个元素,这样兼容性不太好,网站稍微改一下就不行了,需要自己改动下,通过类名或者 id 等等来提高。

export-xpath

测试 xpath

Console 里面用 \$x(path) 来测试是否可以提取到元素。
\$x(path)
返回一个与给定 XPath 表达式匹配的 DOM 元素数组。
记得这里只能提取元素,不能提取元素的属性。

test-xpath

今天在 youtube 上看到 [Open Jarvis] 如何讓Python 自動將語音轉譯成文字?, 就想到前段时间想
把没有字幕的视频里面的语音转成文字, 这样好理解一些(英语渣的苦,谁渣谁知道).
找了半天都是要申请 api key 还有各种限制, 比如收费(哈哈).

安装

ffmpeg

官网: http://ffmpeg.org/
这里用它来提取视频里面的音频.

1
brew install ffmpeg

SpeechRecognition

SpeechRecognition
它是把在线的,离线的,各种aip接口集合在一起,然后提供统一的接口.
因为我不喜欢要申请key的,所以只能用离线的 CMU Sphinx.

先去 CMUSphinx下载安装 sphinxbase pocketsphinx, 可以看 cmusphinx/pocketsphinx.
在安装它们之前还要安装 swig,不然会报错.

1
brew install swig

都安装好后在

1
pip install pocketsphinx

提取音频

比如我要提取 das-0091-introduction-to-computation-4k.mp4 的音频为 test.wav

1
ffmpeg -i das-0091-introduction-to-computation-4k.mp4 -vn test.wav

因为 AudioFile 接口只支持 WAV/AIFF/FLAC, 所以用的是 wav.

提取文字

可以看官方的例子 audio_transcribe.py 很简单.

1
2
3
4
5
6
7
8
import os
import speech_recognition as sr

r = sr.Recognizer()
with sr.AudioFile('test.wav') as source:
audio = r.record(source)

r.recognize_sphinx(audio)

输出如下:

this series is going to cover competition and i think we should begin by just laying out the topics are going to mention along the way so we have a sense of where we’re going the first of those topics is the radical simplicity of computation it turns out that all the complexity of our computers and programming languages and operating systems is comp complexity that we have added it is not fundamental to computation and to see that we’re going to look both at the turn machine and abby lana calculus which are the two most well known models of computation were these the two most well known abstract models both of these are normally talk with the very mathematical kind of terminology indication lot of greek letters and so on but we’re just going to use python code because python is aline was any programmer can understand pretty easily and that wall hours to get at least a high level understanding of what’s going on inside of the systems the next topic that we’re going to talk about is the limits of computation specifically be holding problem which is an example of an undesirable problem at the computer science be holding problems as they did really quickly is the problem of writing a function let’s call a halt it takes another function of scott after the decides whether half will terminate we’re not itself will terminate eventually then hold your return troops itself wolford sample would forever than a halt to return faults and it’s easy to state that problem as i just did it but you cannot write this launch and no programming language can express this function no computational system can express his function at the highly non obvious result but there are rigorous mathematical proof so this going back to the nineteen thirty’s and they have held up for a year’s both in theory and practice so will see why that’s true are these the high level sketch of why that’s true and some of the implications of it for the rest of computation we’re also going to see the structure of computation specifically the idea of trying equivalents which tells us that the turing machine and amanda calculus are both capable of answering the same questions any question that one of them can answer the other cancer and it turns out to this is true of our real world computers as well including the laptop and i’m recording this on if my laptop unanswered question and so could turn machine and this is extremely surprising given how simple turn machines are exactly first look at them you’re not going to believe that their actual general purpose computer system but it turns out that in fact they are because of turner problems which once again as a rigorous mathematical proof is going back to the nineteen eighties excuse me hit eighty years ago the nineteen thirty’s a related idea that will talk about is the chon ski hierarchy of computational systems and it turns out that this turn of quo blood type of system is only the most powerful type of computation there are four levels and as hierarchy and they began with the weakest which is a bullet to what we called regular depressions the next level is what you would need if you wanted to recognize python code mi you want to decide whether as spring is valid python or is not about python the next levels which would need if you wanted to recognize as c. plus plus code and this distinction is very important in fact python was intentionally designed to require less complexity in the top additional system that recognizes that most programming languages have this level of complex including for example a job as script and one of the reasons the python is so we see the lord and revisit potentially was designed to require less complexity finally the last level in this hierarchy is the turner problem level which contains trainers shane solana calculus my laptop and so on and this hierarchy is first of all amazing just as relates these things that seem unrelated if you haven’t learned as yet but that’s not even the most amazing thing about it the most amazing thing is it known tom speed created this hierarchy when he was studying linguistics he was studying natural languages like english and he establishes different levels of linguistic complexity which computer scientists then took the news for all kinds of things including programming languages but also categorizing finite state machines which fall into these different categories in different ways and will see all of these things are more detail as we go that’s all i want to say about this introduction next time we’re going to pick up trade machines were gonna ride that simulator that could be wrapped hemlines a code and take about ten minutes of soul be quite easy to write so i’ll see you next time for training sheets

总结

例子很简单,主要时间花在安装包上面,提取出来的文字虽然还有错误, 但大体意思还是不难理解的.不知道其他在线的接口的效果怎么样,我个人觉得正确率应该会高些,有兴趣的可以自己试下.