Python框架值Scrapy的学习

前言

已经很久很久没有写笔记了,这次要打算系统的学习Scrapy框架,因此把笔记做好,以免以后又要重新查文档。

在做这个之前已经把xici代理所有IP+端口下载下来了,但是为了学习这个Scrapy,重新拿那个项目做个例子,有助于学习,也可以对比出普通requests的爬虫和这个Scrapy框架的优缺点。

准备工作

开发环境

  • Python 3.5
  • VSCode + CoderRunner 插件

目前一直使用VSCode作为开发环境,包括笔记的记录,也是使用VSCode来做的,感觉VSCode除了不能对Python的输入进行处理,其他一切都OK。

安装Scrapy

  • python -m pip install scrapy

只需要使用命令行输入以上代码即可,注意环境变量Python的顺序,如果有anaconda更好。

开始项目

创建项目

  • scrapy startproject [project_name] [domain]

使用以上方法可以创建一个基础的爬虫项目

创建爬虫

  • scrapy genspider [spider_name]
  • scrapy genspider -t crawl [project_name] [domain]

需要在项目的最上层目录,即与cfg文件所在的同级目录执行以上命令,使用 -t 可以指定crawl模板。

sqlalchemy 数据库操作

sqlAlchemy是python中最著名的ORM(Object Relationship Mapping)框架。

ORM 从开发角度说,就是采用面向对象的方式来直接操作数据库,方便的很,不用写那么多数据库查询语句。

这里记录一下增删查改命令。

安装方法:

1
2
python -m pip install sqlalchemy
python -m pip install pymysql

参考代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from sqlalchemy import Column,String,Integer,create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

engine = create_engine('mysql+pymysql://root:root@localhost:3306/info')

DBSession = sessionmaker(bind=engine)


class User(Base):
__tablename__ = 'user'
id = Column(Integer, primary_key=True)
name = Column(String)
passwd = Column(String)

# 增加数据
try:
db = DBSession()
db.add(User(name="Hacker1",passwd="1234561"))# 增
db.commit()
db.close()
except :
pass

## 查询数据
try:
db = DBSession()
# 查所有
results = db.query(User).all()
for r in results:
print(r.name)
#条件过滤
result = db.query(User).filter_by(id = 4).first()
print(result.name)
# 条件大于
results = db.query(User).filter(User.id > 5).all()
for r in results:
print(r.name)
db.close()
except :
pass

## 修改数据
try:
db = DBSession()
result = db.query(User).filter_by(id = 7).first()
result.name = "what"
db.commit()
db.close()
except :
pass

## 删
try:
db = DBSession()
result = db.query(User).filter_by(id = 9).first()
db.delete()
db.commit()
db.close()
except Exception as e :
print(e)
pass

利用python连接并控制SSH

使用pexpect来连接ssh

(该工具只能在linux下使用)

python连接ssh并且执行命令,显示输出结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/python
import pexpect
PROMPT = ['# ','>>> ','> ','\$ ']
def send_command(child,cmd):
child.sendline(cmd)
child.expect(PROMPT)
print child.before

def connect(user,host,password):
ssh_newkey = 'Are you sure you want to continue connecting (yes/no)?'
connStr = 'ssh ' + user + '@' + host
child = pexpect.spawn(connStr)
ret = child.expect([pexpect.TIMEOUT,ssh_newkey,'[P|p]assword: '])
if ret == 0:
print '[-] Error Connecting'
return
if ret ==1:
print "send yes to server!\n"
child.sendline('yes')
ret = child.expect([pexpect.TIMEOUT,'[P|p]assword: '])
if ret == 0:
print '[-] Error Connecting'
return
child.sendline(password)
child.expect(PROMPT)
return child
ss = connect("root","heiyiren.top","heiyiren312429020!@#")
send_command(ss,'cat /etc/shadow | grep root')

python之端口扫描

使用python脚本进行tcp全连接测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from socket import *
from threading import Thread
import optparse

def connPort(ipAddr,port):
setdefaulttimeout(1)
try:
conn = socket(AF_INET,SOCK_STREAM)
conn.connect((ipAddr,port))
conn.send(b"python\r\n")
print("[+] %d/tcp open"%port)
results = conn.recv(100)

print("[+] %s"%str(results,encoding="utf-8"))
except Exception as e:
#print(e)
pass

if __name__ == "__main__":
parser = optparse.OptionParser("python portscan.py -H <host ip addr> -p <port>")
parser.add_option("-H",dest='hAddr',type='string',help='specify ip addr')
parser.add_option("-p",dest='port',type='string',help='specify port')
(options,args) = parser.parse_args()
if options.hAddr==None or options.port==None:
print(parser.usage)
else:
ports = str(options.port).split(",")
for i in ports:
t = Thread(target=connPort,args=(options.hAddr,int(i)))
t.start()

python破解zip密码

今天写了个小脚本,用于暴力破解zip密码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import zipfile
from threading import Thread
import optparse

def testZipPasswd(zfile,passWord):
try:
zfile.extractall(pwd=passWord)
print("[+] Found Password: "+str(passWord,encoding="utf-8"))
except Exception as e:
#print(e)
pass
def decodeZipPasswd(zfile,passfile):
with open(passfile) as f:
zfile = zipfile.ZipFile(zfile)
for i in f.readlines():
i = i.replace("\n","")
t = Thread(target=testZipPasswd,args=(zfile, i.encode(encoding="utf-8")))
t.start()
if __name__ == "__main__":
parser = optparse.OptionParser("python zip.py -f <zipfile> -p <passfile>")
parser.add_option("-f",dest='zname',type='string',help='specify zip file')
parser.add_option("-p",dest='passwd',type='string',help='specify password file')
(options,args) = parser.parse_args()
if options.zname ==None or options.passwd==None:
print(parser.usage)
exit(0)
else:
decodeZipPasswd(options.zname,options.passwd)

python获取网卡信息

Python获取本机IP地址的一般方法为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import socket

IP = socket.gethostbyname(socket.gethostname())
  通过gethostname获取主机名,再用gethostbyname将主机名转换为IP地址。

  那么,问题来了。如果主机有多个网卡/IP,怎样获取某个指定的IP地址呢?

  一个方法是通过socket.gethostbyname_ex获取主机IP地址列表,然后遍历列表取得自己需要的IP。

import socket

#多网卡情况下,根据前缀获取IP(Windows 下适用)
def GetLocalIPByPrefix(prefix):
localIP = ''
for ip in socket.gethostbyname_ex(socket.gethostname())[2]:
if ip.startswith(prefix):
localIP = ip

return localIP


print(GetLocalIPByPrefix('192.168'))

更简单的方法(不用修改代码,还是用socket.gethostname函数),是通过配置hosts文件改变IP优先级。

上面的方法只支持IPv4,如果要获取IPv6信息,参考socket.getaddrinfo。

1、用系统库获取单机 mac 地址。

import uuid

针对单网卡

1
2
3
4
def GetMAC():
addr = hex(uuid.getnode())[2:].upper()

return '-'.join(addr[i:i+2] for i in range(0, len(addr), 2))

2、用第三方库 psutil 打印网络适配器信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import psutil

# 打印多网卡 mac 和 ip 信息
def PrintNetIfAddr():
dic = psutil.net_if_addrs()
for adapter in dic:
snicList = dic[adapter]
mac = '无 mac 地址'
ipv4 = '无 ipv4 地址'
ipv6 = '无 ipv6 地址'
for snic in snicList:
if snic.family.name in {'AF_LINK', 'AF_PACKET'}:
mac = snic.address
elif snic.family.name == 'AF_INET':
ipv4 = snic.address
elif snic.family.name == 'AF_INET6':
ipv6 = snic.address
print('%s, %s, %s, %s' % (adapter, mac, ipv4, ipv6))

跨平台的根据前缀获取 ip 的方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import psutil

# 多网卡情况下,根据前缀获取IP
# 测试可用:Windows、Linux,Python 3.6.x,psutil 5.4.x
# ipv4/ipv6 地址均适用
# 注意如果有多个相同前缀的 ip,只随机返回一个
def GetLocalIPByPrefix(prefix):
localIP = ''
dic = psutil.net_if_addrs()
for adapter in dic:
snicList = dic[adapter]
for snic in snicList:
if not snic.family.name.startswith('AF_INET'):
continue
ip = snic.address
if ip.startswith(prefix):
localIP = ip

return localIP


print(GetLocalIPByPrefix('192.168'))

base16解题方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import base64
s = "633765666566653461316130643465386535613065366563653165376130653966336261613065366563653165376662623865346238623665346235623162346236623665326233623565316230653662326238623562336533653665336535623262336238623365346230623562336664"
s1 = str(base64.b16decode(s))[2:-1:1]
print(s1)
j = 0
s2 = ""
d = []
i = ''
for i in str(s1):
j = j + 1
s2 = s2 + i
if(j%2==0):
d.append(s2)
s2 = ""

for q in range(1,300):
s = ""
for i in d:
s = s + chr(int(i,16)%q)
if "flag" in s:
print (s)
print(d)

python异或操作

1
2
3
4
with open("dddddddddddddddd.txt") as f:
s = f.read()
for i in s :
print(chr(ord(i)^ord('d')),end="")

python 爆破PDF密码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import PyPDF2
def decrypt_pdf(pdf_file_decrypt):
filename = 'password.txt'
file = open(filename, 'r')
file_type = file.readlines()
pdf_reader_1=PyPDF2.PdfFileReader(open(pdf_file_decrypt,'rb'))
for i in file_type:
rs = i.replace('\n', '')
#print str(rs)
if pdf_reader_1.decrypt(str(rs))==1:
print "yes:"+str(rs)
break
else:
pass
pdf_file_decrypt="as.pdf"
decrypt_pdf(pdf_file_decrypt)

Python爬虫初尝试

PS:看小说总有一些奇怪的广告,不知道的人还以为我在看奇怪的东西..

所以把小说下载下来,存到自己服务器去

第一步:2种方式下载页面

方式1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# _*_ coding:utf-8 _*_
import urllib.request
import urllib.parse
import lxml.html
import lxml.cssselect
import os

//自己写个下载的子程序:
def download(url, time=2, data=None, user_agent="test", proxy=None):
print("downloading.....", url)
headers = {'User-agent': user_agent, "accept-language": "zh-CN,zh;q=0.9,en;q=0.8"}
request = urllib.request.Request(url, data, headers=headers)
opener = urllib.request.build_opener()
proxy_parm = {urllib.parse.urlparse(url).scheme: proxy}
try:
if proxy:
opener.add_handler(urllib.request.ProxyHandler(proxy_parm))
res = opener.open(request).read()
except urllib.request.URLError as e:
print("download error ",e.reason)
res = None
if time > 0:
download(url, time-1)
else:
print("Success")
return str(res, encoding="utf-8")