2021-10-08

pythonで株価をスクレイピングする

Python スクレイピング

前回Webページのスクレイピングまでできたので今回は株価のスクレイピングを試します。

freelancer.hatenablog.jp

株価を取得できるサイト

スクレイピングが許可されているサイトを調べてみると下記が見つかりました。

この中からページのソースを見て解析しやすそうだった株探から株価を取得することにしました。

株探のページ情報

URL

https://kabutan.jp/stock/?code=銘柄コード

武田薬品工業の株価情報(2021年10月7日時点)

kabutan.jp

武田薬品工業は現在保有中の銘柄です。

ソース(一部)

<div id="kobetsu_left">

<dl>
  <dt>前日終値</dt>
  <dd class="floatr">3,340.0&nbsp;(<time datetime="2021-10-06">10/06</time>)</dd>
</dl>


<h2><time datetime="2021-10-07">10月07日</time></h2>

<table>
  <tbody>
    <tr>
      <th scope="row">始値</th>
      <td>3,200.0</td>
      <td class="mark">&nbsp;</td>
      <td>(<time datetime="2021-10-07T09:03+09:00">09:03</time>)</td>
    </tr>
    <tr>
      <th scope="row">高値</th>
      <td>3,245.0</td>
            <td class="mark">&nbsp;</td>
            <td>(<time datetime="2021-10-07T14:52+09:00">14:52</time>)</td>
    </tr>
    <tr>
      <th scope="row">安値</th>
      <td>3,157.0</td>
            <td class="mark">&nbsp;</td>
            <td>(<time datetime="2021-10-07T09:06+09:00">09:06</time>)</td>
    </tr>
    <tr>
            <th scope="row">終値</th>
                  <td>3,221.0</td>
            <td class="mark">&nbsp;</td>
      <td>(<time datetime="2021-10-07T15:00+09:00">15:00</time>)</td>
    </tr>
  </tbody>
</table>
・
・
・
</div>

始値、高値、安値、終値辺りが取得したい項目

株探の解析

サンプルプログラム

import requests
from bs4 import BeautifulSoup

res = requests.get("https://kabutan.jp/stock/?code=4502")
doc = BeautifulSoup(res.text, "html.parser")
el = doc.select("#kobetsu_left tr")
if el is not None:
    for row in el:
        print(row.select("td"))

selectメソッドにCSS セレクタ("#kobetsu_left tr")を指定して要素を取得します。
取得した要素の中からさらにselectメソッドにCSS セレクタ("td")を指定して取得した要素を出力してみます。

実行結果

[<td>3,200.0</td>, <td class="mark"> </td>, <td>(<time datetime="2021-10-07T09:03+09:00">09:03</time>)</td>]
[<td>3,245.0</td>, <td class="mark"> </td>, <td>(<time datetime="2021-10-07T14:52+09:00">14:52</time>)</td>]
[<td>3,157.0</td>, <td class="mark"> </td>, <td>(<time datetime="2021-10-07T09:06+09:00">09:06</time>)</td>]
[<td>3,221.0</td>, <td class="mark"> </td>, <td>(<time datetime="2021-10-07T15:00+09:00">15:00</time>)</td>]
・
・
・
・

株価部分を含んだ要素が取得できてますね。
この結果からさらに株価部分を取得できれば目的は達成できます。

サンプルプログラム

import requests
from bs4 import BeautifulSoup

res = requests.get("https://kabutan.jp/stock/?code=4502")
doc = BeautifulSoup(res.text, "html.parser")
el = doc.select("#kobetsu_left tr")
if el is not None:
    print("始値" + el[0].select("td")[0].getText())
    print("高値" + el[1].select("td")[0].getText())
    print("安値" + el[2].select("td")[0].getText())
    print("終値" + el[3].select("td")[0].getText())

実行結果

始値3,200.0
高値3,245.0
安値3,157.0
終値3,221.0

必要なデータが無事取得できました。
保有している銘柄コードの配列作ってURLの銘柄コード部分を動的に変えれば保有している銘柄の株価が取得できます。
スクレイピングが許可されているとは言え大量にアクセスすると負荷をかけてしまうので数秒に1回くらいにアクセスするようにする必要はありそうです。

2021-10-06

保有中の株の株価を自動集計したいのでpythonでスクレイピングする方法を調査

Python スクレイピング

保有中の株の株価を自動で取得してデータベース化したいのでpythonでスクレイピングする方法を調査しました。

requestsモジュール

最低限必要なのはrequestsモジュール

pipコマンドを使ってインストールします。

pip3 install requests

スクレイピング対象のページ

freelancer.hatenablog.jp

この記事の中で作ったサンプルーページをスクレイピング対象にします。

http://sample.kansai-fan.com/py/pymysql.py

サンプルプログラム

scrape_html.py

import requests
import json

res = requests.get("http://sample.kansai-fan.com/py/pymysql.py")
print(res.text)

requestsモジュールのgetメソッドで指定したURLのレスポンスを取得
レスポンスのtextプロパティを出力すればhtmlの内容が出力されます。

実行結果

<html><meta charset='utf-8'><body>
<h1>PyMySQLサンプル</h1>
件数2
</body></html>

BeautifulSoupモジュール

次にやりたいのはhtmlから特定のデータを抽出すること。
これはBeautifulSoupモジュールを使えば簡単に実装できます。

こちらもpipコマンドを使ってインストールします。

pip3 install beautifulsoup4

サンプルプログラム

scrape_html.py

import requests
import json
from bs4 import BeautifulSoup

res = requests.get("http://sample.kansai-fan.com/py/pymysql.py")
doc = BeautifulSoup(res.text, "html.parser")
el = doc.find("h1")
if el is not None:
    print(el.get_text())

コンストラクタで取得したhtmlとパーサーを指定すれば解析は完了。
後はfindメソッドを使って要素を取得、get_text()メソッドで内容を表示させます。

実行結果

PyMySQLサンプル

<h1>の内容が出力されました。
findメソッドにタグを指定すると最初の要素のみ取得。
複数の要素を取得したいときはfind_allメソッドを使う必要があります。

これでスクレイピングの準備は完了。
次回は株価取得に挑戦する予定です。

2021-10-03

スターサーバーでpythonからMySQLに接続できた！

スターサーバー Python CGI Bottle mysql-connector-python PyMySQL MySQL

前回はpythonで使えるMySQLライブラリを調べてみました。

freelancer.hatenablog.jp

結論から言うとmysql-connector-pythonとPyMySQLを使ってMySQLに接続してデータを取得することができました。

ライブラリのダウンロードと設置

mysql-connector-python

下記からmysql-connector-python-8.0.26.tar.gz をダウンロードする。

pypi.org

解凍してmysql ディレクトリをサーバーにアップロードする。

f:id:freelancer13:20211003211620p:plain — mysql-connector-python

PyMySQL

下記からPyMySQL-1.0.2.tar.gz をダウンロードする。

pypi.org

解凍してpymysqlディレクトリをサーバーにアップロードする。

f:id:freelancer13:20211003211818p:plain — PyMySQL

アップロード後

f:id:freelancer13:20211003213757p:plain — アップロード後のディレクトリ構成

動作確認用のテーブル

f:id:freelancer13:20211003214914p:plain — 動作確認用のテーブル

サンプルプログラム(mysql-connector-python)

mysqlconnector.py

#!/usr/bin/python3.6
import mysql.connector as mydb
import sys, io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding = 'utf-8')

connection = mydb.connect(host='aaa.bbb.ccc',
                          port='3306',
                          user='dbuser',
                          password='dbpassword',
                          database='db_sample',
                          charset='utf8'
)
cursor = connection.cursor()
cursor.execute("select count(*) as cnt from sample1")

cnt = 0
row = cursor.fetchone()
if row is not None:
     cnt = row[0]

cursor.close()
connection.close

print("Content-Type: text/html; charset=utf-8\n")
print("<html><meta charset='utf-8'><body>")
print("<h1>mysql-connector-pythonサンプル</h1>")
print("件数" + str(cnt))
print("</body></html>")

サンプルプログラム(PyMySQL)

pymysql.py

#!/usr/bin/python3.6
import pymysql
import pymysql.cursors

import sys, io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding = 'utf-8')

# コネクションの作成
connection = pymysql.connect(host='aaa.bbb.ccc',
                             user='dbuser',
                             port=3306,
                             password='dbpassword',
                             db='db_sample',
                             charset='utf8',
                             cursorclass=pymysql.cursors.DictCursor)
     
cursor = connection.cursor()
cursor.execute("select count(*) as cnt from sample1")

cnt = 0
row = cursor.fetchone()
if row is not None:
     cnt = row["cnt"]

print("Content-Type: text/html; charset=utf-8\n")
print("<html><meta charset='utf-8'><body>")
print("<h1>PyMySQLサンプル</h1>")
print("件数" + str(cnt))
print("</body></html>")

コネクション関連の設定はmysql-connector-pythonとほぼ同じ。
ポートの指定が数値型なのとcursorclassにpymysql.cursors.DictCursorを指定することにより結果を辞書型で返してくれます。

実行結果

f:id:freelancer13:20211003220723p:plain — mysqlconnector.pyの実行結果

f:id:freelancer13:20211003221644p:plain — pymysql.pyの実行結果

件数がちゃんと出力されました。
これでスターサーバーでpythonとMySQLを使ってWebアプリを作る環境が整いました。

2021-09-25

スターサーバーでpythonからMySQLに接続するのは無理なのか？

スターサーバー Python CGI Bottle

前回はBottleのルーティング機能を試しました。

freelancer.hatenablog.jp

後はMySQLからデータを取得できればそれなりのアプリが作れるはず

調べてみるとpythonの標準ライブラリにはMySQLに接続するライブラリがなさそう

pythonで使えるMySQLライブラリを調べてみると以下が見つかりました。

MySQLライブラリ一

MySQLdb
mysql-connector-python
PyMySQL

基本的にはpipを使ってインストールする必要があるのでpipが使えない？スターサーバーではインストールができない

pypi.org

こちらで各ライブラリのソースをダウンロードできるのでソースをダウンロードしてFTPでアップロードすれば動くんじゃないか？

と思い動作検証することにしました。

次回はMySQLdbを使ってスターサーバーでMySQLに接続できるかを検証したいと思います。

2021-09-22

Bottleのルーティング機能を試す

スターサーバー Python CGI Bottle

スターサーバーでBottleを使ったページを表示させることができました。

freelancer.hatenablog.jp

今回は動的ルーティングを試しました。

サンプルプログラム

index.py

#!/usr/bin/python3.6

from bottle import route, run, template

@route('/')
def index():
    return template("sample", name="テスト")

※
@route('/<name>')
def test(name):
    return template("sample", name=name)

run(server='cgi')

※ ルートディレクトリ/xxxにアクセスするとtest(name)が実行され、引数のnameにはxxxが設定される

sample.html

<html lang="ja">
    <head>
        <title>サンプル</title>
    </head>
    <body>
        {{ name }}
    </body>
</html>

これで/xxxにアクセスすると画面上にxxxが表示されるはずなのですが・・・

f:id:freelancer13:20210922003640p:plain — 404エラー

404エラーになりました。

index.pyにアクセスできていないようですね。

下記のようにURLを変更すれば表示されるようになりました。

/index.py/xxx

f:id:freelancer13:20210922011842p:plain — /index.py/xxx

日本語も大丈夫

/index.py/テスト

f:id:freelancer13:20210922012639p:plain — /index.py/テスト

しかしできるならドキュメントルート+ "/xxx"でちゃんと表示されるようにしたいですね。
.htaccessに下記を追加してURLをリライトするようにします。

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization}]
RewriteBase /
RewriteRule ^index\.py$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.py/%{REQUEST_URI} [L]
</IfModule>

/xxxにアクセスするとURLの表示内容は変わらないけど裏では /index.py/xxxにアクセスするようになります。

.htaccessを更新して/xxxにアクセスすると

表示されました。

index.pyをコントローラにすれば動的なページの作成も簡単に実装できそうです。

株価を取得できるサイト

Google Finance

株探

みんかぶ

株探のページ情報

URL

武田薬品工業の株価情報(2021年10月7日時点)

ソース(一部)

株探の解析

サンプルプログラム

実行結果

サンプルプログラム

実行結果

requestsモジュール

スクレイピング対象のページ

サンプルプログラム

scrape_html.py

実行結果

BeautifulSoupモジュール

サンプルプログラム

scrape_html.py

実行結果

ライブラリのダウンロードと設置

mysql-connector-python

PyMySQL

アップロード後

動作確認用のテーブル

サンプルプログラム(mysql-connector-python)

mysqlconnector.py

サンプルプログラム(PyMySQL)

pymysql.py

実行結果

MySQLライブラリ一

サンプルプログラム

index.py

sample.html