音频慎入!枕边女友每天读论文哄我睡觉
深度学习的论文读起来总是有点艰难,看不下去咋办?
让飞桨帮我读吧︿( ̄︶ ̄)︿
如何让飞桨可以自己“读”论文,也就是实现文字转语音的任务?简单分解一下,通过实现下面三个场景的文字转语音(TTS,Text-to-Speech)任务就可以做到:
HTML页面论文介绍 PDF论文摘要 图片英文语句OCR
这三个场景的实现用到了这两个飞桨开发套件:
1. 采用飞桨Parakeet开发套件实现文字转语音,并选用WaveFlow和Griffin-Lim两种声码器分别实现文字转语音的拟声合成。WaveFlow属于基于深度神经网络的声码器,而Griffin-Lim是在仅知幅度谱、不知道相位谱的条件下重建语音的算法,属于经典声码器,算法简单,高效。读者可以在最终TTS效果音频中对比两种算法的拟声合成差异。
Parakeet(项目地址:
https://github.com/PaddlePaddle/Parakeet)
飞桨语音合成套件,提供了灵活、高效、先进的文本到语音合成工具,帮助开发者更便捷高效地完成语音合成模型的开发和应用。
前置项目Parakeet:手把手教你训练语音合成模型(脚本任务、Notebook)
2. 采用飞桨PaddleOCR 开发套件,实现图片文字转为可读文本。论文中有图片,图片中的文字需要先转成文本文字,才能“读”出来,用OCR模型即可实现。文本转语音的过程是对每个单词进行发音,OCR模型不仅需要认“字”,还需要认“词”。因此,本项目中使用PaddleOCR中可识别空格的预训练模型,将图片文字转为可读文本。
PaddleOCR(项目地址:
https://github.com/PaddlePaddle/PaddleOCR)
飞桨文字识别套件,旨在打造一套丰富、领先、实用的文字检测、识别模型和工具库,开源了超轻量级中文OCR模型和通用中文OCR模型,提供了数十种文本检测、识别模型训练方法,助力使用者训练出更好的模型,并应用落地。
HTML文章段落朗读效果:
----------------------------
Audio synthesis has a variety of applications, including text-to-speech (TTS),
music generation, virtual assistant, and digital content creation.
In recent years, deep neural network has obtained noticeable successes for
synthesizing raw audio in high-fidelity speech and music generation.
One of the most successful examples are autoregressive models (e.g., WaveNet).
However, they sequentially generate high temporal resolution of raw waveform (e.g., 24 kHz) at synthesis,
which are prohibitively slow for real-time applications.
Many researchers from various organizations have spent considerable effort to develop parallel generative models for raw audio.
Parallel WaveNet and ClariNet could generate high-fidelity audio in parallel,
but they require distillation from a pretrained autoregressive model and a set of auxiliary losses for training,
which complicates the training pipeline and increases the cost of development.
GAN-based model can be trained from scratch, but it provides inferior audio fidelity than WaveNet.
WaveGlow can be trained directly with maximum likelihood,
but the model has huge number of parameters (e.g., 88M parameters) to reach the comparable fidelity of audio as WaveNet.
Today, we’re excited to announce WaveFlow (paper, audio samples), the latest milestone of audio synthesis research at Baidu.
It features: 1) high-fidelity & ultra-fast audio synthesis, 2) simple likelihood-based training,
and 3) small memory footprint, which could not be achieved simultaneously in previous work.
Our small-footprint model (5.91M parameters) can synthesize high-fidelity speech (MOS: 4.32)
more than 40x faster than real-time on a Nvidia V100 GPU.
WaveFlow also provides a unified view of likelihood-models for raw audio,
which includes both WaveNet and WaveGlow as special cases and allow us to explicitly trade inference parallelism for model capacity.
Our paper will be presented at ICML 2020.
For more details of WaveFlow, please check out our paper: https://arxiv.org/abs/1912.01219
Audio samples are in: https://waveflow-demo.github.io/
The implementation can be accessed in Parakeet, which is a text-to-speech toolkit building on PaddlePaddle:
https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow
----------------------------
PDF摘要文章朗读效果,阅读段落:
----------------------------
Abstract
In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which
is directly trained with maximum likelihood.
It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture,
while modeling the local variations using expressive autoregressive functions.
WaveFlow provides a unified view of likelihood-based models for 1-D data,
including WaveNet and WaveGlow as special cases.
It generates high-fidelity speech as WaveNet,
while synthesizing several orders of magnitude faster as
it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.
Furthermore, it can significantly reduce the likelihood gap that has existed
between autoregressive models and flow-based models for efficient synthesis.
Finally, our small-footprint WaveFlow has only 5.91M parameters,
which is 15× smaller than WaveGlow.
It can generate 22.05 kHz high-fidelity audio 42.6× faster than real-time
(at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.
----------------------------
OCR图片文字朗读效果:
以下操作过程已经在AI Stuidio上开放,可以在线体验,当然读者也可以尝试在自己电脑上参考运行:
https://aistudio.baidu.com/aistudio/projectdetail/676162
第一步:下载并安装工具库
安装Parakeet模型库
注意:安装完成后如果出现Parakeet模型库import报错的情况,需要重启项目才能正常import
!git
clone https://github.com/PaddlePaddle/Parakeet
!
cd Parakeet
!pip install -e .
!
cd ..
import nltk
nltk.download(
"punkt")
nltk.download(
"cmudict")
准备Parakeet预训练模型
需要准备的预训练模型包括:
WaveFlow模型128比特率的预训练模型 FastSpeech文字转语音预训练模型
安装PaddleOCR
准备支持空格的识别预训练模型
%cd ../..
安装Beautiful Soup等工具库
!pip
install bs4
!pip
install xlwt
!pip
install xlrd
!pip
install lxml
!pip
install w3lib
!pip
install pdfminer3k
第二步:解析文章内容
对HTML网页文章、普通PDF和图片文字三种典型场景的文章内容解析方法如下。
解析HTML文章:
这里使用requests模块和Beautiful Soup库对Baidu Research上关于WaveFlow的介绍 WaveFlow: A Compact Flow-Based Model for Raw Audio 页面内容进行爬取和清洗。
Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。
参考链接:
Beautiful Soup 4.4.0 文档 Python beautiful soup解析html获得数据 BeautifulSoup中find和find_all的使用 利用BeautifulSoup去除HTML指定标签和去除注释 AI Studio项目:《青春有你2》选手信息爬取
json
import re
import requests
import datetime
from bs4
import BeautifulSoup
import os
defprint_crawl_data(url, save_path):"""
爬取指定url的Html页面内容并打印
"""
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' }
url = url
try:
response = requests.get(url,headers=headers)
# print(response.status_code)#将一段文档传入BeautifulSoup的构造方法,就能得到一个文档的对象, 可以传入一段字符串 soup = BeautifulSoup(response.text)
# [s.extract() for s in soup('a')]# 按css搜索# #返回的是class为'style':'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'的<span>所有标签 texts = soup.find_all(
'span',{
'style':
'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'})
for text
in texts:
#对当前节点前面的标签和字符串进行查找,并指定查找内容为文本# print(text.text)with open(
'%s' % (save_path),
'a')
as f:
result = text.text
print (result)
f.write(result +
"\n")
except Exception
as e:
print(e)
print_crawl_data(
'http://research.baidu.com/Blog/index-view?id=139',
'article.txt')
Audio synthesis has a variety of applications, including text-to-speech (TTS), music generation, virtual assistant,
and digital content creation. In recent years, deep neural network has obtained noticeable successes
for synthesizing raw audio
in high-fidelity speech
and music generation. One of the most successful examples are autoregressive models (e.g., WaveNet). However, they sequentially generate high temporal resolution of raw waveform (e.g.,
24 kHz) at synthesis, which are prohibitively slow
for real-time applications.
Many researchers
from various organizations have spent considerable effort to develop parallel generative models
for raw audio. Parallel WaveNet
and ClariNet could generate high-fidelity audio
in parallel, but they require distillation
from a pretrained autoregressive model
and a set of auxiliary losses
for training, which complicates the training pipeline
and increases the cost of development. GAN-based model can be trained
from scratch, but it provides inferior audio fidelity than WaveNet. WaveGlow can be trained directly
with maximum likelihood, but the model has huge number of parameters (e.g.,
88M parameters) to reach the comparable fidelity of audio
as WaveNet.
Today, we’re excited to announce WaveFlow (paper, audio samples), the latest milestone of audio synthesis research at Baidu. It features:
1) high-fidelity & ultra-fast audio synthesis,
2) simple likelihood-based training,
and3) small memory footprint, which could
not be achieved simultaneously
in previous work. Our small-footprint model (
5.91M parameters) can synthesize high-fidelity speech (MOS:
4.32) more than
40x faster than real-time on a Nvidia V100 GPU. WaveFlow also provides a unified view of likelihood-models
for raw audio, which includes both WaveNet
and WaveGlow
as special cases
and allow us to explicitly trade inference parallelism
for model capacity.
Our paper will be presented at ICML
2020.For more details of WaveFlow, please check out our paper: https://arxiv.org/abs/
1912.01219Audio samples are
in: https://waveflow-demo.github.io/
The implementation can be accessed
in Parakeet, which
is a text-to-speech toolkit building on PaddlePaddle: https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow
with open(
'article.txt',
'r',encoding =
'utf-8')
as fr,open(
'article2.txt',
'w',encoding =
'utf-8')
as fd:
for text
in fr.readlines():
if text.split():
fd.write(text)
print(
'完成去空行处理...')
完成去空行处理...
with open(
'article2.txt',
'r',encoding =
'utf-8')
as fr,open(
'article3.txt',
'w',encoding =
'utf-8')
as fd:
for text
in fr.readlines():
text = text.replace(
'.',
'.\n')
fd.write(text)
print(
'完成去换行处理...')
注意:由于Parakeet模型库的预训练模型都是在短句上训练的,为保证较好的语音合成效果,还需要手动对txt文件进一步整理,最终修改效果可查看article3.txt文件。
解析PDF文章
这里使用pdfminer解析PDF(注:普通PDF,不能解析的PDF需要转成图片进行OCR识别),另外需注意在python3中,需要安装的工具库是pdfminer3k。
在示例中,将对 WaveFlow: A Compact Flow-based Model for Raw Audio 这篇论文的PDF文件(下载后重命名为waveflow.pdf)进行解析,将摘要提取出来,为后续文字转语音(TTS)做好准备。
参考链接:
Python使用pdfminer解析PDF Python去除文本文件中的空行
urllib
import importlib,sys
importlib.reload(sys)
from pdfminer.pdfparser
import PDFParser, PDFDocument
from pdfminer.pdfdevice
import PDFDevice
from pdfminer.pdfinterp
import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter
import PDFPageAggregator
from pdfminer.layout
import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp
import PDFTextExtractionNotAllowed
defparse(DataIO, save_path):#用文件对象创建一个PDF文档分析器 parser = PDFParser(DataIO)
#创建一个PDF文档 doc = PDFDocument()
#分析器和文档相互连接 parser.set_document(doc)
doc.set_parser(parser)
#提供初始化密码,没有默认为空 doc.initialize()
#检查文档是否可以转成TXT,如果不可以就忽略ifnot doc.is_extractable:
raise PDFTextExtractionNotAllowed
else:
#创建PDF资源管理器,来管理共享资源 rsrcmagr = PDFResourceManager()
#创建一个PDF设备对象 laparams = LAParams()
#将资源管理器和设备对象聚合 device = PDFPageAggregator(rsrcmagr, laparams=laparams)
#创建一个PDF解释器对象 interpreter = PDFPageInterpreter(rsrcmagr, device)
#循环遍历列表,每次处理一个page内容#doc.get_pages()获取page列表for page
in doc.get_pages():
interpreter.process_page(page)
#接收该页面的LTPage对象 layout = device.get_result()
#这里的layout是一个LTPage对象里面存放着page解析出来的各种对象#一般包括LTTextBox,LTFigure,LTImage,LTTextBoxHorizontal等等一些对像#想要获取文本就得获取对象的text属性for x
in layout:
try:
if(isinstance(x, LTTextBoxHorizontal)):
with open(
'%s' % (save_path),
'a')
as f:
result = x.get_text()
print (result)
f.write(result +
"\n")
except:
print(
"Failed")
#解析本地PDF文本,保存到本地TXTwith open(
'waveflow.pdf',
'rb')
as pdf_html:
parse(pdf_html,
'pdf2text_output.txt')
with open(
'pdf2text_output.txt',
'r',encoding =
'utf-8')
as fr,open(
'abstract.txt',
'w',encoding =
'utf-8')
as fd:
for text
in fr.readlines()[
60:
86:]:
if text.split():
fd.write(text)
print(text)
print(
'摘要打印完成')
Abstract
In this work, we propose WaveFlow, a small-
footprint generative flow
for raw audio, which
is directly trained
with maximum likelihood. It
handles the long-range structure of
1-D wave-
form
with a dilated
2-D convolutional architec-
ture,
while modeling the local variations using
expressive autoregressive functions. WaveFlow
provides a unified view of likelihood-based mod-
els
for1-D data, including WaveNet
and Wave-
Glow
as special cases. It generates high-fidelity
speech
as WaveNet,
while synthesizing several
orders of magnitude faster
as it only requires a
few sequential steps to generate very long wave-
forms
with hundreds of thousands of time-steps.
Furthermore, it can significantly reduce the likeli-
hood gap that has existed between autoregressive
models
and flow-based models
for efficient syn-
thesis. Finally, our small-footprint WaveFlow has
only
5.91M parameters, which
is15× smaller
than WaveGlow. It can generate
22.05 kHz high-
fidelity audio
42.6× faster than real-time (at a rate
of
939.3 kHz) on a V100 GPU without engineered
inference kernels.
摘要打印完成
注意:为保证较好的语音合成效果,论文中换行连字符需要手动处理,最终修改效果可查看abstract.txt文件。
对PaddleOCR/tools/infer/predict_system.py中的main()函数下面这一部分稍作修改,只识别文字,比较直观:
drop_score =
0.5 dt_num = len(dt_boxes)
for dno
in range(dt_num):
text, score = rec_res[dno]
if score >= drop_score:
# 只打印文本,并存储为txt文件# text_str = "%s, %.3f" % (text, score) with open(
'../ocr_text.txt',
'a')
as f:
text_str =
"%s" % (text)
f.write(text_str +
"\n")
print(text_str)
!cd /home/aistudio/PaddleOCR
/home/aistudio/PaddleOCR
# 找一些英文名言的图片!wget https:
//quotefancy.com/media/wallpaper/
3840x2160/
50594-Francis-Bacon-Quote-Knowledge-
is-power.jpg --
no-check-certificate
!wget https:
//www.quotemaster.org/images/
24/
2423b4151b7283c4570e2967fbf022cf.jpg
!wget https:
//www.promptaconsultinggroup.com/wp-content/uploads/
2018/
10/Focus-
on-Results.jpg
!wget https:
//quotefancy.com/media/wallpaper/
1600x900/
50583-Francis-Bacon-Quote-Knowledge-
is-power.jpg --
no-check-certificate
!wget https:
//quotefancy.com/media/wallpaper/
3840x2160/
2347129-William-Shakespeare-Quote-To-be-
or-
not-to-be-that-
is-the-question.jpg --
no-check-certificate
-
-2020-08-0219:
40:
58-- https:
//www.promptaconsultinggroup.com/wp-content/uploads/
2018/
10/Focus-
on-Results.jpg
Resolving www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)...
67.43.226.3Connecting to www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)|
67.43.226.3|:
443... connected.
HTTP request sent, awaiting response...
200 OK
Length:
883254 (
863K) [image/jpeg]
Saving to: ‘Focus-
on-Results.jpg’
Focus-
on-Results.jp
100%[==================
=>]
862.55K
11.6KB/s
in72s
2020-08-0219:
42:
14 (
12.0 KB/s) - ‘Focus-
on-Results.jpg’ saved [
883254/
883254]
!python tools/infer/predict_system.py \
--image_dir=
"50594-Francis-Bacon-Quote-Knowledge-is-power.jpg" \
--det_model_dir=
"./inference/ch_det_r50_vd_db/" \
--rec_model_dir=
"./inference/ch_rec_r34_vd_crnn_enhance/" \
--use_space_char=True
dt_boxes num :
6, elapse :
0.02082991600036621rec_res num :
6, elapse :
0.019023895263671875Predict time
of50594-Francis-Bacon-Quote-Knowledge-
is-power.jpg:
0.097s
Knowledge
ispower
Francis
Bacon
quotefancy
The visualized image saved
in ./inference_results/
50594-Francis-Bacon-Quote-Knowledge-
is-power.jpg
OCR文字识别效果:
第三步:文字转语音
在该步骤中,需要对示例的Parakeet/examples/fastspeech/synthesis.py进行修改,关键就是将指定语句输入的效果测试修改为按行读取txt文件生成语音。synthesis()函数的修改如下,完成修改内容请查看synthesis.py文件
def synthesis(args):
local_rank = dg.parallel.Env().local_rank
place = (fluid.CUDAPlace(local_rank)
if args.use_gpu
else fluid.CPUPlace())
fluid.enable_dygraph(place)
with
open(args.
config) as f:
cfg = yaml.
load(f, Loader=yaml.Loader)
ifnotos.
path.exists(args.
output):
os.mkdir(args.
output)
writer = SummaryWriter(
os.
path.join(args.
output,
'log'))
model = FastSpeech(cfg[
'network'], num_mels=cfg[
'audio'][
'num_mels'])
# Load parameters.
global_step =
io.load_parameters(
model=model, checkpoint_path=args.checkpoint)
model.eval()
# 按行读取txt文本并生成语音
for i,line
in enumerate(
open(args.text_input)):
text_input = line
text = np.asarray(text_to_sequence(text_input))
text = np.expand_dims(text, axis=
0)
pos_text = np.arange(
1, text.shape[
1] +
1)
pos_text = np.expand_dims(pos_text, axis=
0)
text = dg.to_variable(text).astype(np.int64)
pos_text = dg.to_variable(pos_text).astype(np.int64)
_, mel_output_postnet = model(text, pos_text, alpha=args.alpha)
if args.vocoder ==
'griffin-lim':
#synthesis use griffin-lim
wav = synthesis_with_griffinlim(mel_output_postnet, cfg[
'audio'])
elif args.vocoder ==
'waveflow':
wav = synthesis_with_waveflow(mel_output_postnet, args,
args.checkpoint_vocoder, place)
else:
print(
'vocoder error, we only support griffinlim and waveflow, but recevied %s.' % args.vocoder)
writer.add_audio(text_input +
'(' + args.vocoder +
')', wav,
0,
cfg[
'audio'][
'sr'])
ifnotos.
path.exists(
os.
path.join(args.
output,
'samples')):
os.mkdir(
os.
path.join(args.
output,
'samples'))
write(
os.
path.join(
os.
path.join(args.
output,
'samples'), args.vocoder + str(i) +
'.wav'),
cfg[
'audio'][
'sr'], wav)
print(
"Synthesis completed !!!")
writer.
close()
!export CUDA_VISIBLE_DEVICES=
0env: CUDA_VISIBLE_DEVICES=
0!cd /home/aistudio/Parakeet/examples/fastspeech
/home/aistudio/Parakeet/examples/fastspeech
使用WaveFlow作为声码器朗读HTML文章
!python synthesis.py \
--use_gpu=1 \--alpha=1.0 \--checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' \--config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \--output='./synthesis' \--vocoder='waveflow' \--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \--text_input='/home/aistudio/article3.txt'{
'alpha':
1.0,
'checkpoint':
'./fastspeech_ljspeech_ckpt_1.0/step-162000',
'checkpoint_vocoder':
'./waveflow_res128_ljspeech_ckpt_1.0/step-2000000',
'config':
'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
'config_vocoder':
'./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml',
'output':
'./synthesis',
'text_input':
'/home/aistudio/article3.txt',
'use_gpu':
1,
'vocoder':
'waveflow'}
生成的TTS音频保存在
Parakeet/examples/fastspeech/synthesis/samples文件夹下,可以选择几段音频验证效果
IPython
IPython.display.Audio(
'synthesis/samples/waveflow3.wav')
使用ffmpeg合并
生成的音频文件
由于前面是通过对文本逐行扫描生成的音频文件,如果希望听到完整的文章段落,就需要将生成的音频文件按顺序拼接。
用ffmpeg拼接音频前需要先准备一个list.txt文件,格式如下:
file 'path/to/file1'
file 'path/to/file2'
file 'path/to/file3'
然后执行命令 ffmpeg -f concat -i list.txt -c copy "outputfile"完成拼接
i,line
in enumerate(open(
'/home/aistudio/article3.txt')):
with open(
'waveflow_article3.txt',
'a') as f:
result =
'file synthesis/samples/waveflow' + str(i) +
'.wav' f.write(result +
"\n")
# 音频拼接!ffmpeg -f concat -i waveflow_article3.txt -c copy
'waveflow_article3.wav'ffmpeg version 2.8.15-0ubuntu0.16.04.1 Copyright (c) 2000-2018 the FFmpeg developers
built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 20160609
configuration: --prefix=/usr --extra-version=0ubuntu0.16.04.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --
enable-gpl --
enable-shared --
disable-stripping --
disable-decoder=libopenjpeg --
disable-decoder=libschroedinger --
enable-avresample --
enable-avisynth --
enable-gnutls --
enable-ladspa --
enable-libass --
enable-libbluray --
enable-libbs2b --
enable-libcaca --
enable-libcdio --
enable-libflite --
enable-libfontconfig --
enable-libfreetype --
enable-libfribidi --
enable-libgme --
enable-libgsm --
enable-libmodplug --
enable-libmp3lame --
enable-libopenjpeg --
enable-libopus --
enable-libpulse --
enable-librtmp --
enable-libschroedinger --
enable-libshine --
enable-libsnappy --
enable-libsoxr --
enable-libspeex --
enable-libssh --
enable-libtheora --
enable-libtwolame --
enable-libvorbis --
enable-libvpx --
enable-libwavpack --
enable-libwebp --
enable-libx265 --
enable-libxvid --
enable-libzvbi --
enable-openal --
enable-opengl --
enable-x11grab --
enable-libdc1394 --
enable-libiec61883 --
enable-libzmq --
enable-frei0r --
enable-libx264 --
enable-libopencv
libavutil 54. 31.100 / 54. 31.100
libavcodec 56. 60.100 / 56. 60.100
libavformat 56. 40.101 / 56. 40.101
libavdevice 56. 4.100 / 56. 4.100
libavfilter 5. 40.101 / 5. 40.101
libavresample 2. 1. 0 / 2. 1. 0
libswscale 3. 1.101 / 3. 1.101
libswresample 1. 2.101 / 1. 2.101
libpostproc 53. 3.100 / 53. 3.100
[0;33mGuessed Channel Layout
for Input Stream
#0.0 : mono[0mInput
#0, concat, from 'waveflow_article3.txt': Duration: N/A, start: 0.000000, bitrate: 705 kb/s
Stream
#0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, 1 channels, flt, 705 kb/sOutput
#0, wav, to 'waveflow_article3.wav': Metadata:
ISFT : Lavf56.40.101
Stream
#0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, mono, 705 kb/sStream mapping:
Stream
#0:0 -> #0:0 (copy)Press [q] to stop, [?]
forhelpsize= 16235kB time=00:03:08.49 bitrate= 705.6kbits/s
video:0kB audio:16235kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000686%
使用Griffin-Lim算法
作为声码器朗读HTML文章
!python synthesis.py \
--use_gpu=
1 \
--alpha=
1.0 \
--checkpoint=
'./fastspeech_ljspeech_ckpt_1.0/step-162000' \
--config=
'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
--output=
'./synthesis' \
--text_input=
'/home/aistudio/article3.txt'{
'alpha':
1.0,
'checkpoint':
'./fastspeech_ljspeech_ckpt_1.0/step-162000',
'checkpoint_vocoder':
None,
'config':
'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
'config_vocoder':
None,
'output':
'./synthesis',
'text_input':
'/home/aistudio/article3.txt',
'use_gpu':
1,
'vocoder':
'griffin-lim'}
IPython
IPython.display.Audio(
'synthesis/samples/griffin-lim3.wav')
使用ffmpeg合并
生成的音频文件
i,
line
inenumerate(
open('/home/aistudio/article3.txt')):
with
open(
'griffin-lim_article3.txt', 'a')
as f:
result
=
'file synthesis/samples/griffin-lim' + str(i) +
'.wav' f.write(result +
"\n")
# 音频拼接!ffmpeg -f concat -i griffin-lim_article3.txt -c copy
'griffin-lim_article3.wav'论文摘要和OCR文字
转语音效果
abstract.txt和ocr_text.txt的TTS实现过程和上面的article3.txt完全一致,唯一不同在于OCR识别最终合成的音频文件比较小,可以直接在Notebook中查看效果。
1. 论文摘要TTS:
!python synthesis.py \
--use_gpu=1 \--alpha=1.0 \--checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' \--config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \--output='./synthesis' \--vocoder='waveflow' \--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \--text_input='/home/aistudio/abstract.txt'# 生成list文件
for i,line
in enumerate(
open(
'/home/aistudio/abstract.txt')):
with
open(
'waveflow_abstract.txt',
'a') as f:
result =
'file synthesis/samples/waveflow' + str(i) +
'.wav' f.
write(result +
"\n")
# 音频拼接
!ffmpeg -f
concat -i waveflow_abstract.txt -c copy
'waveflow_abstract.wav'2. OCR识别TTS(Knowledge is Power)
注:ocr_text.txt中内容较少,已手动整理成一行文字。
!python synthesis.py \
--use_gpu=
1 \
--alpha=
1.0 \
--checkpoint=
'./fastspeech_ljspeech_ckpt_1.0/step-162000' \
--config=
'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
--output=
'./synthesis' \
--vocoder=
'waveflow' \
--config_vocoder=
'./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
--checkpoint_vocoder=
'./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
--text_input=
'/home/aistudio/ocr_text.txt'{
'alpha':
1.0,
'checkpoint':
'./fastspeech_ljspeech_ckpt_1.0/step-162000',
'checkpoint_vocoder':
'./waveflow_res128_ljspeech_ckpt_1.0/step-2000000',
'config':
'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
'config_vocoder':
'./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml',
'output':
'./synthesis',
'text_input':
'/home/aistudio/ocr_text.txt',
'use_gpu':
1,
'vocoder':
'waveflow'}
[
checkpoint] Rank
0: loaded model
from ./fastspeech_ljspeech_ckpt_1
.0/step
-162000.pdparams
[
checkpoint] Rank
0: loaded model
from ./waveflow_res128_ljspeech_ckpt_1
.0/step
-2000000.pdparams
Synthesis completed !!!
!mv synthesis/samples/waveflow0.wav ./ocr.wav
import IPython
IPython.display.Audio(
'ocr.wav')
小结:
TTS效果如何进一步提升?
1. 找到更好的智能排版办法,本项目虽然使用Python对HTML和PDF解析后的文章进行了部分处理,但最后一个环节的排版调整还是手动完成的,TTS效果才比较好。需要进一步结合正则表达式等NLP处理技术,优化自动排版(想必这块也是业界难题,比如最新的Edge浏览器也存在排版问题)。
2. Parakeet的预训练模型只是在LJSpeech数据集上训练得到的,可以考虑加入更多的语音数据集继续训练,得到更加丰富的发音风格和更准确的发音效果,使用Parakeet的训练过程可参考 Parakeet:手把手教你训练语音合成模型(脚本任务、Notebook)。
3. PaddleOCR提供的预训练模型在英文识别上效果可以进一步提升,可以尝试用PaddleOCR在更多英文OCR数据集上训练。(后续将更新)
完整项目包括项目代码、文字文件等均公开在AIStudio上,欢迎Fork。
https://aistudio.baidu.com/aistudio/projectdetail/676162
如果您想详细了解更多飞桨的相关内容,请参阅以下文档。
Gitee:
https://gitee.com/paddlepaddle/PaddleOCR
▼
觉得不错,请点个在看呀
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。