Dust8 的博客

records 源码阅读

发表于 2020-10-25 更新于 2023-03-23

kennethreitz-archive/records:
Records: SQL for Humans™.
records 是 python 的一个简单的库, 代码量比较少, 用来学习阅读笔记容易懂. 它主要是封装了 SQLAlchemy 和 Tablib 库, 一个用来处理数据库的操作, 一个用来格式化各种导出. 从中我们可以学会如何组合已有的库.

源码阅读

找入口

通过官方的文档示例

import records

db = records.Database('postgres://...')
rows = db.query('select * from active_users')    # or db.query_file('sqls/active-users.sql')

从上面可以看出, Database 类是主要的入口类.

class Database(object):
    """A Database. Encapsulates a url and an SQLAlchemy engine with a pool of
    connections.
    """

    def __init__(self, db_url=None, **kwargs):
        # If no db_url was provided, fallback to $DATABASE_URL.
        self.db_url = db_url or os.environ.get('DATABASE_URL')

        if not self.db_url:
            raise ValueError('You must provide a db_url.')

        # Create an engine.
        self._engine = create_engine(self.db_url, **kwargs)
        self.open = True

主要类的关系

Database 类的数据库操都是操作 Connection 类, 而 Connection 是 SQLAlchemy 的连接封装.

class Database(object):
    ...

    def get_connection(self):
        """Get a connection to this Database. Connections are retrieved from a
        pool.
        """
        if not self.open:
            raise exc.ResourceClosedError('Database closed.')

        return Connection(self._engine.connect())

    def query(self, query, fetchall=False, **params):
        """Executes the given SQL query against the Database. Parameters can,
        optionally, be provided. Returns a RecordCollection, which can be
        iterated over to get result rows as dictionaries.
        """
        with self.get_connection() as conn:
            return conn.query(query, fetchall, **params)

class Connection(object):
    """A Database connection."""

    def __init__(self, connection):
        self._conn = connection
        self.open = not connection.closed

    def query(self, query, fetchall=False, **params):
        """Executes the given SQL query against the connected Database.
        Parameters can, optionally, be provided. Returns a RecordCollection,
        which can be iterated over to get result rows as dictionaries.
        """

        # Execute the given query.
        cursor = self._conn.execute(text(query), **params) # TODO: PARAMS GO HERE

        # Row-by-row Record generator.
        row_gen = (Record(cursor.keys(), row) for row in cursor)

        # Convert psycopg2 results to RecordCollection.
        results = RecordCollection(row_gen)

        # Fetch all results if desired.
        if fetchall:
            results.all()

        return results

从源码学使用

因为没有文档, 有些使用需要看源码才会, 比如使用事务

class Database(object):
    ...

    @contextmanager
    def transaction(self):
        """A context manager for executing a transaction on this Database."""

        conn = self.get_connection()
        tx = conn.transaction()
        try:
            yield conn
            tx.commit()
        except:
            tx.rollback()
        finally:
            conn.close()

可以看到返回的是一个支持上下文管理的连接对象

with db.transaction() as tx:
    user = {"name": "yuze9", "age": 20}
    tx.query('INSERT INTO lemon_user(name,age) values (:name, :age)', **user)
    # 下面是错误的 sql 语句，有错误，则上面的 sql 语句不会成功执行。
    tx.query('sof')

参考链接

django查询的坑

发表于 2020-08-24 更新于 2023-03-23

这些坑都是官方文档有说明的, 但是二手资料太多, 误人啊.

F 表达式的坑

为了避免条件竞争, 多个地方操作同一条记录,造成数据错误, 就使用了 F 表达式. 而在后面又使用了该记录的实例, 造成 F() 执行了多遍, 数据错误, 需要保存后使用 refresh_from_db() 来重载模型对象才能避免出错.

# 错误, 会执行 2 次 F 表达式
reporter = Reporters.objects.get(name='Tintin')
reporter.stories_filed = F('stories_filed') + 1
reporter.save()

reporter.name = 'Tintin Jr.'
reporter.save()

# 正确, F 表达式只执行了 1 次
reporter = Reporters.objects.get(name='Tintin')
reporter.stories_filed = F('stories_filed') + 1
reporter.save()
reporter.refresh_from_db()

reporter.name = 'Tintin Jr.'
reporter.save()

annotate 表达式的坑

下面是取 user_id 最大的一个权重记录查询, 要注意使用 order_by() 不然的话会包含模型里面的 ordering 排序

hitlogs = (
        HitLog.objects.filter(user_id__in=user_ids)
        .values("user_id")
        .annotate(weight=Max("weight"))
        .order_by()
    )

2.2 版后已移除:
从 Django 3.1 中开始，模型的 Meta.ordering 中的排序不会使用在 GROUP BY 查询，
比如 .annotate().values() 。从 Django 2.2 开始，这些查询发出一个弃用警告，
指示要在查询集中添加一个显式的 order_by() 来静默警告。

参考链接

vue与table

发表于 2020-08-18 更新于 2023-03-23

最近有个后台表格需要排序功能, 原生的感觉麻烦, 就选用了 Element 来做.

安装

由于只是一个页面,就选用了 cdn 的方式, 后期把 unpkg 的换成了 https://cdnjs.cloudflare.com, 因为 unpkg 慢而且还有频率限制.

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <!-- import CSS -->
  <link rel="stylesheet" href="https://unpkg.com/element-ui/lib/theme-chalk/index.css">
</head>
<body>
  <div id="app">
    <el-button @click="visible = true">Button</el-button>
    <el-dialog :visible.sync="visible" title="Hello world">
      <p>Try Element</p>
    </el-dialog>
  </div>
</body>
  <!-- import Vue before Element -->
  <script src="https://unpkg.com/vue/dist/vue.js"></script>
  <!-- import JavaScript -->
  <script src="https://unpkg.com/element-ui/lib/index.js"></script>
  <script>
    new Vue({
      el: '#app',
      data: function() {
        return { visible: false }
      }
    })
  </script>
</html>

表格

需要用到的功能有多选, 排序, 固定表头,自定义列模板

<el-table :data="predata" style="width: 100%" height="600"
    :default-sort="{prop: 'user_study_score', order: 'descending'}"
    @selection-change="handlePreSelectionChange">
    <el-table-column type="selection" width="45">
    </el-table-column>
    <el-table-column prop="pre_order_id" label="订单号" sortable width="100">
    </el-table-column>
    <el-table-column prop="weight" label="风险等级" sortable width="100" sortable>
        <template slot-scope="scope">
            <span :class=" scope.row.weight_class">{{ scope.row.weight }}</span>
        </template>
    </el-table-column>
    <el-table-column prop="pre_mobile" label="预约人" width="120">
    </el-table-column>
    <el-table-column prop="pre_pay_time" label="预约时间" sortable>
    </el-table-column>
    <el-table-column prop="user_study_score" label="学分" sortable>
    </el-table-column>
    <el-table-column prop="pre_price" label="订单金额" sortable>
    </el-table-column>
</el-table>

重新渲染数据

有一列风险等级是从其他接口调用的, 然后把值设置为数据的属性, 但是数据回来了还是没有渲染回来, 需要特殊的设置方法.

1
2
3

element.weight_class = 'color_weight ' + class_name;
// 这样才能主动渲染数据新属性
this.$set(element, 'weight', res.weight)

参考链接

爬虫与docker

发表于 2020-08-17 更新于 2023-03-23

最近又写了个爬虫, 爬美团的商家.总结一下新写法.
因为接口是加密的,懒得去找参数了,直接用浏览器渲染, 试了下 request-html 效果不好, 也不想用 pyppeteer 来写, 选择了无脑的渲染服务 splash.

javascript 渲染服务

splash 安装直接用 docker 来安装, 方便快捷

1	docker run -it -p 8050:8050 --rm scrapinghub/splash

因为渲染页面会执行服务端的识别代码, 所以需要不断的重开容器来避免请求数太多

1	sudo docker stop $(sudo docker ps -f ancestor=scrapinghub/splash -q) && sudo docker run -it -d -p 8050:8050 --rm scrapinghub/splash

可以把它放到定时任务里面,例如半个小时重开一次

1 2	# crontab -e 30 * * * * sudo docker stop $(sudo docker ps -f ancestor=scrapinghub/splash -q) && sudo docker run -it -d -p 8050:8050 --rm scrapinghub/splash

代理服务

由于有反爬, 所以需要不停的换 ip 地址, 不然会出现验证码等反爬响应.买的话性价比也不高, 选择了开源的自建代理池 https://github.com/kagxin/proxy-pool 也是支持用 docker 部署

重试库

免费代理质量一般都不咋样, 所以需要不停的重试来保障请求的完整.下面是很粗暴的解析不到数据就主动引发错误来重试.

from retrying import retry

splash_url = "http://localhost:8050/render.html"

@retry
def get_shop_urls(url, headers=None):
    proxy = get_proxy()
    loggers.info(f"get_shop_urls proxy {proxy} {url}")
    params = {
        "url": url,
        "http_method": "GET",
        "headers": {
            "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
            "user-agent": random.choice(USER_AGENTS),
        },
        "wait": 2,
        "images": 0,
        "proxy": proxy,
    }
    html = requests.post(
        splash_url,
        data=json.dumps(params),
        headers={"Content-Type": "application/json"},
    ).text

    if "对不起" in html:
        return []

    tree = fromstring(html)
    links = []
    for item in tree.xpath('//ul[@class="list-ul"]//div[@class="info"]//a/@href'):
        links.append(item)

    if not links:
        raise ValueError()

    return links

参考链接

django与数据库视图

发表于 2020-08-17 更新于 2023-03-23

项目里面使用了视图, 使用的时候报错 Unknown column 'xxx.id' in 'field list'.
原因是 django 模型里面必须有主键, 看哪个字段适合做主键就设置为主键就可以了.

参考链接:

django 调用 MySQL 视图时报错 “Unknown column ‘project_staff.id’ in ‘field list’”