Python爬虫技术与实现方法

作者：neal发布时间：2025-03-21 11:54分类：我的文章浏览：151评论：0

导读：一、基础爬虫方法 1. Requests + BeautifulSoup 适用场景：静态网页抓取（无复杂 JavaScript）。示例代码： python import re...

一、基础爬虫方法

1. Requests + BeautifulSoup

适用场景：静态网页抓取（无复杂 JavaScript）。

示例代码：
python

import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)
if response.status_code == 200:
  soup = BeautifulSoup(response.text, 'html.parser')
  titles = soup.find_all('h1')
  for title in titles:
      print(title.text)

2. 正则表达式（Regex）

适用场景：快速提取特定模式的文本。
python

import re
text = '<h1>Hello World</h1>'
match = re.search(r'<h1>(.*?)</h1>', text)
if match:
  print(match.group(1))  # 输出: Hello World

二、动态网页爬取

1. Selenium

适用场景：需要模拟浏览器操作（如点击、滚动、登录）。

示例代码：
python

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get('https://example.com')
element = driver.find_element(By.TAG_NAME, 'h1')
print(element.text)
driver.quit()

2. Playwright

更高效的浏览器自动化工具：
python

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()
  page.goto('https://example.com')
  print(page.inner_text('h1'))
  browser.close()

三、爬虫框架

1. Scrapy

适用场景：大型项目，支持异步、中间件、数据管道。

创建项目：
bash

scrapy startproject myproject
scrapy genspider example example.com

Spider 示例：
python

import scrapy

class ExampleSpider(scrapy.Spider):
  name = 'example'
  start_urls = ['https://example.com']

  def parse(self, response):
      yield {'title': response.css('h1::text').get()}

2. PySpider

可视化分布式爬虫框架，适合中等规模项目。

四、反爬虫策略与应对

常见反爬手段：
- User-Agent 检测
- IP 封禁
- 验证码
- 请求频率限制

应对方法：

设置请求头：
python

headers = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
requests.get(url, headers=headers)

使用代理 IP：
python

proxies = {'http': 'http://10.10.1.10:3128'}
requests.get(url, proxies=proxies)

降低请求频率：
python

import time
time.sleep(2)  # 每次请求间隔 2 秒

处理验证码：使用 OCR 库（如 Tesseract）或第三方 API。

五、数据存储

保存到文件：
python

import csv
with open('data.csv', 'w', newline='') as f:
   writer = csv.writer(f)
   writer.writerow(['Title'])
   writer.writerow(['Example Title'])

保存到数据库（如 MongoDB）：
python

from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['mydatabase']
collection = db['mycollection']
collection.insert_one({'title': 'Example'})

六、注意事项

遵守法律法规：
- 查看目标网站的 robots.txt（如 https://example.com/robots.txt）。
- 避免侵犯隐私或版权数据。
道德爬虫：
- 控制请求频率，避免对服务器造成压力。
- 标注数据来源。

七、高级技巧

异步爬虫（aiohttp + asyncio）：
python

import aiohttp
import asyncio

async def fetch(url):
   async with aiohttp.ClientSession() as session:
       async with session.get(url) as response:
           return await response.text()

urls = ['https://example.com/page1', 'https://example.com/page2']
tasks = [fetch(url) for url in urls]
pages = asyncio.run(asyncio.gather(*tasks))

使用 API：
- 直接调用网站提供的 API 接口（如 https://api.example.com/data）。

标签：一般