首页   注册   登录
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python 学习手册
Python Cookbook
Python 基础教程
Python Sites
PyPI - Python Package Index
http://www.simple-is-better.com/
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
华为云
V2EX  ›  Python

mongodb 在 scrapy 如何去重,然后下载管道如何管理

  •  
  •   Ewig · 11 天前 · 329 次点击
    from scrapy.pipelines.files import FilesPipeline

    from scrapy import Request

    from scrapy.conf import settings

    import pymongo


    class XiaoMiQuanPipeLines(object):
    def __init__(self):
    host = settings["MONGODB_HOST"]
    port = settings["MONGODB_PORT"]
    dbname = settings["MONGODB_DBNAME"]
    sheetname = settings["MONGODB_SHEETNAME"]

    client = pymongo.MongoClient(host=host, port=port)

    mydb = client[dbname]

    self.post = mydb[sheetname]

    def process_item(self, item):
    url = item['file_url']
    name = item['name']

    result = self.post.aggregate(
    [
    {"$group": {"_id": {"url": url, "name": name}}}
    ]
    )
    if result:
    pass
    else:

    self.post.insert({"url": url, "name": name})
    return item


    class DownLoadPipelines(FilesPipeline):

    def file_path(self, request, response=None, info=None):
    return request.meta.get('filename', '')

    def get_media_requests(self, item, info):
    file_url = item['file_url']
    meta = {'filename': item['name']}
    yield Request(url=file_url, meta=meta)


    这里写两个管道,先判断,如何重复不下载,如果不重复,写入数据库,然后下载,这里用 aggregate 联合键去重
    6 回复  |  直到 2018-10-12 17:31:35 +08:00
        1
    watsy0007   11 天前
    ```python

    class MongoCache:
    db = None

    def __init__(self):
    if not hasattr(MongoCache, 'pool'):
    MongoCache.create_instance()

    @staticmethod
    def create_instance():
    client = MongoClient(config.MONGO_URL)
    MongoCache.db = client['spider']

    def create(self, table, unique_key, origin_data):
    if self.exists(table, unique_key):
    return None

    summaries = {k: generator_summary(v) for (k, v) in origin_data.items()}

    return self.db[table].insert({
    'unique_key': unique_key,
    'data': origin_data,
    'summaries': summaries
    })

    def get(self, table, unique_key):
    data = self.db[table].find_one({'unique_key': unique_key})
    if data is None:
    return None
    return data['data']

    def exists(self, table, unique_key):
    data = self.db[table].find_one({'unique_key': unique_key})
    return data is not None

    def is_changed(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return True

    last_summaries = self.db[table].find_one({'unique_key': unique_key})['summaries']
    for (k, v) in origin_data.items():
    summary = generator_summary(v)
    last_summary = last_summaries.get(k, None)
    # print('{} -> {} | {} -> {}'.format(k, v, summary, last_summary))
    if last_summary is None or last_summary != summary:
    return True
    return False

    def change_fields(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return origin_data
    changes = {}
    last_summaries = self.db[table].find_one({'unique_key': unique_key})['summaries']
    for (k, v) in origin_data.items():
    last_summary = last_summaries.get(k, None)
    # print('{} -> {} | {} -> {}'.format(k, v, summary, last_summary))
    if last_summary is None or last_summary != generator_summary(v):
    changes[k] = v
    return changes

    def update(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return origin_data
    new_summaries = {k: generator_summary(v) for (k, v) in origin_data.items()}
    self.db[table].update_one({'unique_key': unique_key},
    {'$set': {'data': origin_data, 'summaries': new_summaries}})
    return origin_data
    ```
        2
    watsy0007   11 天前
    v2ex 不支持 markdown...



    封装了针对数据处理的公共方法.
        3
    picone   11 天前
    我是直接 mongo 加 unique 索引,并捕捉索引冲突异常。。
        4
    Ewig   8 天前
    @picone 你的是联合键吗?我说的是 url 和 name 一起
        5
    picone   8 天前
        6
    Ewig   8 天前
    @picone db.XiaoMiQuan.find()
    { "_id" : ObjectId("5bbf14dbc96b5b3f5627d11d"), "file_url" : "https://baogaocos.seedsufe.com/2018/07/19/doc_1532004923556.pdf", "name" : "AMCHAM-中国的“一带一路”:对美国企业的影响(英文)-2018.6-8 页.pdf" }我现在是这样写的
    这是对的?
    关于   ·   FAQ   ·   API   ·   我们的愿景   ·   广告投放   ·   感谢   ·   实用小工具   ·   743 人在线   最高记录 3762   ·  
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.1 · 17ms · UTC 23:11 · PVG 07:11 · LAX 16:11 · JFK 19:11
    ♥ Do have faith in what you're doing.
    沪ICP备16043287号-1