What is this?

Mysql query will become slow when table contains millions of rows. A simple way of solving this probolem is spliting the big table into small ones. We use some method to decide which table each row should be moved into. A widely used method is "modulus", e.g. group rows by [ID mod 100]. This tool will help you to do this job easily.

Features

Move records from one table to another(or other) table(s) according to user specified rules.
Dynamically create new tables during moving.
Move records between different mysql servers.
Check data integrity of the new tables.
For records who have been moved to new tables, remove them from source table.
Using it as a python module which means dynamically changing the behavior.

Install

python2: pip install pymysql_split_tool

python3: pip3 install pymysql_split_tool

The following python version are supported and tested: 2.6, 2.7, 3.3, 3.4, 3.5, 3.6, 3.7

Cli Usage

Just create a split task file and run this tool with it:

python -m pymysql_split_tool --action split --task ./task.json

Supported parameters:

parameter name	description
action	what type of task to do. required.
task	path to a file contains the detailed infomation for the task. required.
debug	print debug log. optional.

There are two type of action supported:

action	description
split	Move rows from a big table into small tables.
check	Check data integrity of new small tables. ==not implemented yet==.
remove	Remove rows from a big table after checking data integrity in small tables. ==not implemented yet==.

A quick look at a sample task file:

{
  "src":{
    "mysql":{
      "host":"127.0.0.1",
      "port":"3306",
      "unix_socket":"/path/to/mysqld.sock",
      "user":"user_name",
      "password":"password"
    },
    "database":"name_of_database",
    "table":"name_of_the_big_table"
  },
  "dest":{
    "table":"small_table_[n]"
  },
  "rule":{
    "filter":"id>1000",
    "group_method":"modulus",
    "group_base":100,
    "group_column":"id"
  }
}

Samples

Check the test directory for more sample task files.

Supported fields in `task file`:

field @ level-1	field @ level-2	field @ level-3	description	sample
src			infomation of the being splitted table
	mysql		mysql conneciton parameters for PyMysql
		host	mysql host(domain name or ip)	"127.0.0.1"
		port	mysql port	3306
		unix_socket	connect mysql server with unix-socket	"/var/run/mysqld.sock"
		user	mysql user name, need read privilege at least	"root"
		password	mysql user password
	database		name of database
	table		name of table
dest			infomation of the new created small tables
	mysql		mysql conneciton parameters. If not set, will use the same parameters as "src". Has same level-3 fileds like "src"
	database		name of database where we create new small tables. If not set, will use the same database name as "src".
	table		name pattern of table. [n] will be replaced by an integer.	"small_table_[n]"
	create_table_first		if set to 1, it will create the small table if not exists. default is 1. ==0 not supported yet==.	0
	create_table_sql		specify a sql to create new table. If not set, will use 'show create table' to get a creation sql. ==can not specify yet==.	"create table [table_name] if not exists..."
rule			defines how it do the work
	filter		a sql 'where' clause telling it which rows should be moved	"id>1000"
	page_size		page size when select in paging mode	1000
	page_sleep		sleep a few seconds between pages to avoid mysql server busy working	1
	order_by		select order in paging mode	"id asc"
	group_method		split rows by [modulus] or [devide], or [all]. each row will map to an integer.	"modulus"
	group_base		[modulus] or [devide] by which number	100
	group_column		column name which the method will used on.
	group_int		an int array. if set, this tool will move data whose method result fall in this array. for method [modulus] and [devide] only. pattern: [a,b,[from,to],c,d]	[1,7,[12,15],20]
check			for "check" action only. how we check data integrity.
	count		set to 1 to use mysql count() function to compare src and dest..	1
	sum		a column name list on which mysql sum() function is used.	["id","created"]

API Usage

import json
import pymysql_split_tool

task_file = open('path/to/task.json')
task = json.loads(task_file.read())
task_file.close()

#init a split task with a specified task object whose type is dict
pymysql_split_tool.init("split", task)
#run the task
pymysql_split_tool.do_work()

Check here for a sample with monkey patch.

How it works internally

For action split, there are 4 different work flows:

A: if dest.mysql is set, rule.group_int not set, it works in the following steps:
1. make connection to "src" and "dest" mysql server.
2. get data with this sql: select * from src.database.src.table where rule.filter order by rule.order_by limit offset,rule.page_size
3. for each selected data, use rule.group_method on the rule.group_column, get a method result: n
4. ensure table dest.table_[n] existance if dest.create_table_first exists.
5. insert data into new table
B: if dest.mysql is set, rule.group_int is set, it works like this:
1. make connection to "src" and "dest" mysql server.
2. generate an integer array according to rule.group_int.
3. for each integer n, ensure table dest.table_[n] existance if dest.create_table_first exists.
4. for each integer n, get data with this sql: select * from src.database.src.table where rule.filter and rule.group_column method = n order by rule.order_by limit offset,rule.page_size
5. insert data into new table
C: if dest.mysql is not set, rule.group_int not set, it works like this:
1. make connection to "src" mysql server
2. get group list with this sql: select n from src.database.src.table group by rule.group_column method result
3. for each group n, copy data with this sql: replace into dest.database.dest.table select * from src.database.src.table where rule.filter where rule.group_column method result = n
D: if dest.mysql is not set, rule.group_int is set, it works like this:
1. make connection to "src" mysql server
2. copy data with this sql: replace into dest.database.dest.table select * from src.database.src.table where rule.filter

For action check, here is the work flow:

make connection to mysql server.
if rule.group_int is not set, use this sql to get a group integer list: select n from src.database.src.table group by rule.group_column method result
for each integer in group list, select count() and sum() in both src and dest tables. do the comparation.

For action remove, here is the work flow:

make connection to mysql server.
if rule.group_int is not set, use this sql to get a group integer list.
for each integer in group list, delete records in src table.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
pymysql_split_tool		pymysql_split_tool
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
README.rst		README.rst
README_cn.md		README_cn.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Features

Install

Cli Usage

Samples

Supported fields in `task file`:

API Usage

How it works internally

About

Releases

Packages

Languages

License

echoma/pymysql_split_tool

Folders and files

Latest commit

History

Repository files navigation

What is this?

Features

Install

Cli Usage

Samples

Supported fields in task file:

API Usage

How it works internally

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Supported fields in `task file`:

Packages