用Kotlin批量获取Pixiv图片

用Kotlin批量获取Pixiv图片

  最近自己迷上了Kotlin,所以顺手把以前用java写的一个p站的爬虫给改造成了kotlin的,并且用kotlin的协程优化了一下(真的是好用到爆炸x,接下来大概的讲一下改造的过程

Pixiv的接口规则

  首先大概说一下P站的api接口,想要详细了解的可以去Public Api以及App Api来查看具体的接口规则(如果没有特别申明则所有接口返回的都是json格式的数据),我这里只列出我用到的几个:

快乐写代码

TIP: 本文的所有代码都可以在FastPapi查看

获取token

  写代码首先肯定是要明白我们要干什么,我们要爬图片,爬图片分很多种,用户收藏,用户上传,按照tag搜索,日推等等,但是首先所有这些获取的都是图片的集合,图片集合的单元就是一张图片,因此最基层的部分要从获取单张图片的信息开始,然而我们直接去用post拼接上面获取图片信息的接口会发现无法获取信息,因为P站当然没有傻到连验证都不加就把接口放出来让你访问,因此首先最重要的事情是获取token,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class Authorization {
companion object {
@JvmField var accessToken: String? = null
@JvmField var refreshToken: String? = null
@JvmField var name: String? = null
@JvmField var password: String? = null
@JvmField var cookie: String? = null
@JvmField var token: String? = null
}
}
private const val configPath = "config.json"
private const val login = "https://accounts.pixiv.net/login"
private const val tokenUrl = "https://oauth.secure.pixiv.net/auth/token"
private var token: String? = null
private var config: Config? = null
private var cookies: String? = null

fun getToken(email: String, password: String) {
File(configPath).apply {
if (!exists()) createNewFile()
}
val postKey = request(login).entity pattern "name=\"post_key\" value=\"(\\w+)\""
val postCookie = "pixiv_id=$email&password=$password&post_key=$postKey"
val postData = "username=$email&password=$password&grant_type=password&client_id=bYGKuGVw91e0NMfPGp44euvGt59s&client_secret=HP3RmkgAmEGro0gn1x9ioawQE8WMfvLXDz3ZqxpK"
val header = mapOf(
"Referer" to "http://www.pixiv.net",
"User-Agent" to "PixivAndroidApp/5.0.64 (Android 6.0)",
"Content-Type" to "application/x-www-form-urlencoded"
)
val response = deserializeRequestEntity<TokenResponse>(tokenUrl, postData, header)
.entity
cookies = StringBuilder().apply {
request(login, postCookie).cookieList.forEach {
append(it.name).append('=').append(it.value).append(';')
}
}.toString()
Authorization.apply {
accessToken = response.response?.access_token
refreshToken = response.response?.refresh_token
name = email
this.password = password
this.cookie = cookies
token = accessToken
}
token = response.response?.access_token
config = Config(cookies!!, email, password, System.currentTimeMillis() / 1000, response.response?.access_token, response.response?.refresh_token)
config!!.serialize(configPath)
}

  首先有两点要注意,这里发送了两个不同的请求,一个是向https://oauth.secure.pixiv.net/auth/token的post请求,一个是向https://accounts.pixiv.net/login的post请求,前者用于获取token,后者用于获取cookie,之所以这么做的原因是token并不能用作获取特辑的验证,获取特辑的接口只能用cookie作为验证 ,接下来是post的数据,获取token的postbody参数中client_id和client_secret是固定不变的,我们只需要把用户名和密码填到参数里面就好了(这里有一点需要说明,在请求token返回的数据里除了access_token以外还有一个refresh_token,这个token可以用于在access_token过期后请求新的access_token,不过在用refresh_token请求的时候需要把grant_type参数值改成refresh_token),header自然不必多说,请求方法则是我自己写的util文件里面的一个方法(kotlin的方法和属性不必一定在类里面,可以直接写在文件里,这样会直接变成全局可访问的方法/属性),代码如下:

  • request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
fun request(url: String, postData: String? = null, header: Map<String, String>? = null): HttpResponse<String> {
fun post(): HttpResponse<String> {
val postMethod = HttpPost(url)
header?.forEach { postMethod.setHeader(it.key, it.value) }
val entity = StringEntity(postData, "UTF-8")
entity.setContentEncoding("UTF-8")
postMethod.entity = entity

val res = httpClient.execute(postMethod)
return HttpResponse(EntityUtils.toString(res.entity, "UTF-8"), cookieStore.cookies)
}
fun get(): HttpResponse<String> {
val getMethod = HttpGet(url)
header?.forEach{ getMethod.setHeader(it.key, it.value) }
val res = httpClient.execute(getMethod)
return HttpResponse(EntityUtils.toString(res.entity, "UTF-8"), cookieStore.cookies)
}
return if (postData == null) get() else post()
}
  • deserializeRequestEntity
1
2
3
4
inline fun <reified T> deserializeRequestEntity(url: String, postData: String? = null, header: Map<String, String>? = null): HttpResponse<T> {
val response = if (postData == null) request(url,null, header) else request(url, postData, header)
return HttpResponse(response.entity.deserializeAs(), cookieStore.cookies)
}
  • requestOnToken
1
2
3
4
5
fun requestOnToken(url: String, header: Map<String, String>? = null): HttpResponse<String> {
val map = hashMapOf("Authorization" to "Bearer ${token.toString()}", "Cookie" to cookies.toString())
header?.forEach { map[it.key] = it.value }
return request(url, null, map)
}

(顺便说一句reified是一个非常nb的东西,不过由于和本文无关这里不再赘述,有兴趣可以自己去Google一下)

Authorization是一个储存返回信息的数据类,而Config则是用于保存token到本地的一个工具类(个人感觉这个工具类实现的还不错,有兴趣的可以去这里看一下),主要作用是将配置文件保存到本地供下次登录使用

获取单张图片信息

  获取了token之后就可以写处理单张图片的方法了,流程很简单: 发送请求 -> 解析json数据 -> 返回反序列化后的对象:

1
2
3
4
5
6
7
@JvmStatic
fun getDetail(id: String): Picture {
val url = pictureInformation(id)
val header = mapOf("Referer" to "http://spapi.pixiv.net/", "User-Agent" to "PixivIOSApp/5.8.7", "Content-Type" to "application/x-www-form-urlencoded")
val json = requestOnToken(url, header).entity
return json deserializeAsByTypeToken object: TypeToken<Picture>(){}
}
不多说

获取搜索的图片

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@JvmStatic
suspend fun search(tags: String, pageLimit: Int): List<Picture> {
val list = Collections.synchronizedList(mutableListOf<Picture>())
val taskList = mutableListOf<suspend () -> Picture>()
for (page in 1..pageLimit) {
val link = combineSearch(page, tags.encodeToUTF8())
val data = requestOnToken(link).entity
(data patternEach "\"id\":(\\d+),\"title\":\"")
.forEach {
println("add task for id $it")
taskList += { getDetail(it) }
}
}
var i = 0
val job = GlobalScope.launch {
for (task in taskList) async {
try {
list += task()
println("getting picture ${i++}/${taskList.size}")
} catch (ignored: Exception) {
//do nothing
}
}
}
job.join()
return with(list) {
this.sort()
this
}
}

  流程总体上没什么难度,就是拼接链接 -> 请求 -> 从返回的json中解析出每张图片的信息并储存到list中 -> 返回这个list,但是这里主要需要讲的是kotlin的最新官方库: kotlinx.coroutines即”协程”和kotlin的泛型参数与java的不同

  首先这里用到了mutableListOf<T>(),这是kotlin collection的一类, 他返回一个MutableList<T>的新实例,我们看代码:

1
2
3
4
5
6
7
/**
* Returns an empty new [MutableList].
* @sample samples.collections.Collections.Lists.emptyMutableList
*/
@SinceKotlin("1.1")
@kotlin.internal.InlineOnly
public inline fun <T> mutableListOf(): MutableList<T> = ArrayList()

  这里实际上是相当于返回了一个kotlin的ArrayList类(实际上就是java的ArrayList,kotlin只定义了接口,具体实现都是原汁原味的java),然而此处泛型参数的声明方式suspend () -> Picture是在java里面闻所未闻的,这是由于在kotlin里面函数也是对象,你可以像JavaScript那样把他们赋给变量,比如val foo = { println("bar") },此处的foo就是一个() -> Unit类型的函数对象,再比如val foo: (Int) -> String = { "$it" },它接收一个Int类型的值,返回它的字符串形式,因此是一个(Int) -> String类型的函数对象,函数对象的声明规范是(param1, param2, param3, ...) -> return_type,上面代码中的mutableListOf<suspend () -> Picture>实际上就是声明了一个存储suspend () -> Picture类型的函数对象的ArrayList,第二个重点是这里用到了kotlin的协程:

  Wikipedia上对协程有这样的解释:

Coroutines are computer-program components that generalize subroutines for non-preemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations.

   另外,wikipedia上面也对协程/线程的区别做出了如下解释:

Coroutines are very similar to threads. However, coroutines are cooperatively multitasked, whereas threads are typically preemptively multitasked. This means that coroutines provide concurrency but not parallelism. The advantages of coroutines over threads are that they may be used in a hard-realtime context (switching between coroutines need not involve any system calls or any blocking calls whatsoever), there is no need for synchronisation primitives such as mutexes, semaphores, etc. in order to guard critical sections, and there is no need for support from the operating system.It is possible to implement coroutines using preemptively-scheduled threads, in a way that will be transparent to the calling code, but some of the advantages (particularly the suitability for hard-realtime operation and relative cheapness of switching between them) will be lost.

  大致意思就是协程与线程非常相似,但是协程对于线程而言是”合作多任务”的,而线程是”抢占式多任务”的,且协程完全是编译器实现的,而多线程则是涉及到硬件的,因此协程不需要操作系统的支持,而且不需要考虑多线程的阻塞,锁等问题,协程在退出时并不是像一般的程序那样直接退出,而是执行另一个协程,并且在这些协程间不断的往返调用,协程在退出时会保存当前的状态并挂起,而在执行完另一个协程后返回该协程时则从挂起处继续开始

   Melvin Conway对于协程有这样的定义:

当控制流程离开时,协程的执行被挂起,此后控制流程再次进入这个协程时,这个协程只应从上次离开挂起的地方继续 (The execution of a coroutine is suspended as control leaves it, only to carry on where it left off when control re-enters the coroutine at some later stage)

  看上去有点像C#和Python里面的yield关键字,事实上Python的yield确实是协程的实现,在kotlin里面,协程也可以做到异步返回,即可以直接接受异步任务返回的数据,例如在这里使用了launchasync来实现异步调用,首先,我们创建一个要做的task list,可以把他当成一个TODO清单,里面写明了都要做什么事情,这里使用了val taskList = mutableListOf<suspend () -> Picture>(),suspend关键字是因为协程只能运行在标注了suspend的方法里面,它让协程能够挂起,接下来向taskList中添加任务,这里是添加获取图片信息的任务,注意一定要用一对花括号包括起来标明这是一个lambda表达式,而不是执行getDetail()这个方法,有点类似于java里面的Runable(FetchApi::getDetail),接下来在所有的任务都添加到taskList之后对于taskList中的每个任务都异步执行,这里的job变量是一个Job类型的对象,kotlin的协程任务都会返回一个Job对象,可以通过操作这个对象来管理协程的运行,例如可以delay(2000); job.cancelAndJoin()就可以取消当前协程并阻塞获取结果,而上文代码中job.join()则是阻塞当前协程到job完成后再运行

获取用户的收藏

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
suspend fun collection(nextLink: String, pageLimit: Int? = null): List<Picture> {
var link: String? = nextLink
var current = 0
val tasks = mutableListOf<suspend () -> Picture>()
val list = ArrayList<Picture>()
while (link != null) {
if (pageLimit != null)
if (current >= pageLimit) return list
val json = requestOnToken(link).entity
(json deserializeAsByTypeToken object: TypeToken<FavoriteResponse>(){}).apply {
illusts?.forEach {
println("add task for id $it")
tasks += { getDetail(it.id.toString()) }
}
link = next_url
}
current++
}
return with(Collections.synchronizedList(mutableListOf<Picture>())) {
val job = GlobalScope.launch {
var i = 0
for (task in tasks) async {
println("getting picture ${i++}/${tasks.size}")
try {
this@with += task()
} catch (e: Exception) {
//do nothing
}
}
}
job.join()
this.sort()
this
}
}

  不多说,原理一个样,只不过这次获取每页的json数据都有一个next_url指向下一页,可以直接用这个来翻页

获取用户上传

  • upload
1
2
3
4
5
6
7
8
9
10
11
12
@JvmStatic
suspend fun upload(user: String, pageLimit: Int): List<Picture> {
val pics = Collections.synchronizedList(mutableListOf<Picture>())
for (current in 1..pageLimit) {
val link = combineIllustration(user, current)
pics.addAll(deserializePage(link))
}
return pics.apply {
filter(Objects::nonNull)
sort()
}
}
  • deserializePage
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    private suspend fun deserializePage(link: String): List<Picture> {
    val tasks = mutableListOf<suspend () -> Picture>()
    with(requestOnToken(link).entity.replace("\"response\":[{\"", "\"illusts\":[{\"")) {
    deserializeAs<IllustResponse>().illusts?.map { it.id }?.map(Long::toString)?.forEach {
    println("add task for id $it")
    tasks += {
    api.FetchApi.Companion.getDetail(it)
    }
    }
    }
    return with(Collections.synchronizedList(mutableListOf<Picture>())) {
    val job = GlobalScope.launch {
    for (task in tasks) async {
    try {
    this@with += task()
    } catch (e: Exception) {
    // do nothing
    }
    }
    }
    job.join()
    this
    }
    }

下载图片

  • download核心方法

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    @Suppress("UNUSED_EXPRESSION")
    private fun download(picture: Picture, isShowcase: Boolean = false, showcaseTitle: String? = null) {
    val animatePath = "pixiv\\anime"
    val header = mapOf(
    "User-Agent" to "Mozilla/4.0 (compatible;MSIE 7.0; Windows 10; Chrome;)",
    "Accept-Encoding" to "gzip",
    "Referer" to "https://www.pixiv.net/member_illust.php?mode=medium&illust_id=${picture.response!![0].id}",
    "cookie" to "https://www.pixiv.net")
    picture.response?.get(0)?.apply {
    if (isIs_manga) {
    val downloadUrl = image_urls?.large?.replace("_p0.jpg", "")
    for (i in 0..page_count) {
    val location = if (isShowcase) "pixiv\\特辑_$showcaseTitle\\atlas$id\\p$i.jpg" else "pixiv\\atlas$id\\p$i.jpg"
    downloadFile(downloadUrl!!, location, header)
    }
    } else {
    when(this.type) {
    "illustration" -> {
    if (!isIs_manga) {
    downloadFile(
    image_urls?.large!!,
    if (isShowcase) "pixiv\\特辑_$showcaseTitle\\$id.jpg" else "pixiv\\$id.jpg",
    header
    )
    }
    }
    "ugoira" -> {
    val tempFilePath = "$animatePath//$id"
    val bin = arrayOf("$animatePath\\$id.zip", "$animatePath\\$id.jpg", "$animatePath\\$id.png")
    val link = image_urls?.large?.replace("img-original", "img-zip-ugoira")?.replace("_ugoira0.jpg", "_ugoira1920x1080.zip")
    unzip(
    Objects.requireNonNull(URL(link).createConnection(header)).inputStream,
    Paths.get(tempFilePath)
    )
    File(tempFilePath).apply {
    transGif(this.listFiles(), "$animatePath.gif")
    }
    bin.map(::File).forEach{ File::delete }
    }
    }
    }
    }
    }
  • 对外接口

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    @JvmStatic @JvmOverloads
    suspend fun download(picture: List<Picture>, isShowcase: Boolean = false, showcaseTitle: String? = null) {
    val tasks = mutableListOf<suspend () -> Unit>()
    for (pic in picture) {
    tasks += { download(pic, isShowcase, showcaseTitle) }
    }
    val job = GlobalScope.launch {
    for (task in tasks) async {
    task()
    }
    }
    job.join()
    return
    }

解析命令

  完成api之后恐怕最重要的就是解析命令了,虽然没有GUI,但是作为一个命令行工具指令的重要性不言而喻,首先我们定义一个处理指令的接口:

1
2
3
interface CommandParser {
fun parse(command: String): Command
}

  接着定义一个Command类当做存储命令的对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class Command {
enum class Name {
SEARCH,
COLLECTION,
UPLOAD,
SHOWCASE;
companion object {
@JvmStatic
fun parseName(value: String): Name {
return when(value) {
"search" -> SEARCH
"collection" -> COLLECTION
"upload" -> UPLOAD
"showcase" -> SHOWCASE
else -> throw IllegalArgumentException()
}
}
}
}
lateinit var name: Name
lateinit var user: String
var pageLimit = 0
var minBookmark = 0
var isDownload = false
var tag: String? = null
var skipTag: String? = null
}

  然后创建一个具体的适配器类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class Adapter : CommandParser {
override fun parse(command: String): Command {
val parsed = Command()
val args = command.split(' ')
parsed.name = Command.Name.parseName(args[0])
for (arg in args) {
when(arg) {
"-u" -> parsed.user = args[args.indexOf(arg) + 1]
"-d" -> parsed.isDownload = args[args.indexOf(arg) + 1].toBoolean()
"-s" -> parsed.skipTag = args[args.indexOf(arg) + 1]
"-t" -> parsed.tag = args[args.indexOf(arg) + 1]
"-p" -> parsed.pageLimit = args[args.indexOf(arg) + 1].toInt()
"-l" -> parsed.minBookmark = args[args.indexOf(arg) + 1].toInt()
}
}
return parsed
}
companion object {
@JvmStatic
suspend fun invoke(command: Command): List<Picture>? {
val result: List<Picture>? = when(command.name != Command.Name.SHOWCASE) {
true -> when(command.name) {
Command.Name.SEARCH -> { command.tag?.let { FetchApi.search(it, command.pageLimit) } }
Command.Name.COLLECTION -> FetchApi.collection(combineCollection(command.user), command.pageLimit)
Command.Name.UPLOAD -> FetchApi.upload(command.user, command.pageLimit)
else -> return null
}
else -> {
return with(ShowcaseApi.selectShowcase(ShowcaseApi.showcaseList(command.pageLimit))) {
if (command.isDownload) {
GlobalScope.launch {
this@with.keys.forEach {
FetchApi.download(this@with.getValue(it), true, it)
}
}
}
this.flatMap { t -> t.value }
}
}
}
return if (command.name != Command.Name.SHOWCASE) {
result?.filter {
(it.response?.get(0)?.stats?.favorited_count?.favoritedCountPrivate?.plus(it.response?.get(0)?.stats?.favorited_count?.favoritedCountPublic!!))!! > command.minBookmark
}?.filter {
if (command.tag != null) it.response?.get(0)?.tags?.contains(command.tag!!)!! else true
}?.filter {
if (command.skipTag != null) !it.response?.get(0)?.tags?.contains(command.skipTag!!)!! else true
}.apply {
if (command.isDownload) {
GlobalScope.launch {
this@apply?.let { FetchApi.download(it, false, null) }
}
}
}
} else {
return result
}
}
}
}

  parse()的作用是把传进来的字符串指令解析成Command对象,而invoke()的作用则是接收一个Command对象并根据相应的参数执行获取图片的操作,并返回图片列表,接下来在Main类,也就是最终的主类中封装一个对外的接口方法

1
private suspend fun getAdapter(s: String): List<Picture>? = Adapter.invoke(Adapter().parse(s))

  然后我们只需要在main方法里面把readLine()读取的指令传给getAdapter()方法就好了,当然还要做一些简单的异常除了,不过这里有一点要注意,因为短时间内申请登陆太多次会被p站ban一段时间,所以还要做一个简单的计数器和倒计时锁,登录达到一定次数就开启倒计时锁,倒计时结束后清空计数器并关闭锁,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
private var loginCounter = 0
private var lock = false
private val latch = System.currentTimeMillis()
private suspend fun getAdapter(s: String): List<Picture>? = Adapter.invoke(Adapter().parse(s))
suspend fun main() {
var command: String
while (readLine().apply { command = this.toString() } != "-q") {
if (command.startsWith("-login")) {
if (lock || System.currentTimeMillis() - latch >= 120000L) {
loginCounter = 0; lock = false
}
if (loginCounter == 3) {
lock = true
println("you login for too much times, please wait a second")
continue
}
if (!lock) {
println("logining")
if (command =="-login" && command.indexOf(' ') < 0) {
val config = Config.deserialize("config.json")
if ((System.currentTimeMillis() / 1000) - config.date >= 86400) {
println("config has been expired, please use -login [name] [password] to refresh")
}
getToken(config.name, config.password)
} else {
getToken(command.split(' ')[1], command.split(' ')[2])
}
if (Authorization.token != null) println("successful login with token ${Authorization.token} (please cover mosaics on it when screenshot)")
else println("login failed, check if password/account is wrong or your ip/account has been banned because of send request too frequently (mostly because of login request)")
loginCounter++
continue
}
}
try {
getAdapter(command)?.forEach(FetchApi.printPicture)
} catch (e: IllegalArgumentException) {
println("unknown command")
continue
}
}
}

0%