点击左上方蓝字关注我们

深度学习的论文读起来总是有点艰难，看不下去咋办？

让飞桨帮我读吧︿(￣︶￣)︿

项目简介

如何让飞桨可以自己“读”论文，也就是实现文字转语音的任务？简单分解一下，通过实现下面三个场景的文字转语音（TTS,Text-to-Speech）任务就可以做到：

HTML页面论文介绍
PDF论文摘要
图片英文语句OCR

这三个场景的实现用到了这两个飞桨开发套件：

1. 采用飞桨Parakeet开发套件实现文字转语音，并选用WaveFlow和Griffin-Lim两种声码器分别实现文字转语音的拟声合成。WaveFlow属于基于深度神经网络的声码器，而Griffin-Lim是在仅知幅度谱、不知道相位谱的条件下重建语音的算法，属于经典声码器，算法简单，高效。读者可以在最终TTS效果音频中对比两种算法的拟声合成差异。

Parakeet（项目地址：

https://github.com/PaddlePaddle/Parakeet）

飞桨语音合成套件，提供了灵活、高效、先进的文本到语音合成工具，帮助开发者更便捷高效地完成语音合成模型的开发和应用。

前置项目Parakeet：手把手教你训练语音合成模型（脚本任务、Notebook）

2. 采用飞桨PaddleOCR 开发套件，实现图片文字转为可读文本。论文中有图片，图片中的文字需要先转成文本文字，才能“读”出来，用OCR模型即可实现。文本转语音的过程是对每个单词进行发音，OCR模型不仅需要认“字”，还需要认“词”。因此，本项目中使用PaddleOCR中可识别空格的预训练模型，将图片文字转为可读文本。

PaddleOCR（项目地址：

https://github.com/PaddlePaddle/PaddleOCR）

飞桨文字识别套件，旨在打造一套丰富、领先、实用的文字检测、识别模型和工具库，开源了超轻量级中文OCR模型和通用中文OCR模型，提供了数十种文本检测、识别模型训练方法，助力使用者训练出更好的模型，并应用落地。

最终TTS效果

HTML文章段落朗读效果：

----------------------------

Audio synthesis has a variety of applications, including text-to-speech (TTS),

music generation, virtual assistant, and digital content creation.

In recent years, deep neural network has obtained noticeable successes for

synthesizing raw audio in high-fidelity speech and music generation.

One of the most successful examples are autoregressive models (e.g., WaveNet).

However, they sequentially generate high temporal resolution of raw waveform (e.g., 24 kHz) at synthesis,

which are prohibitively slow for real-time applications.

Many researchers from various organizations have spent considerable effort to develop parallel generative models for raw audio.

Parallel WaveNet and ClariNet could generate high-fidelity audio in parallel,

but they require distillation from a pretrained autoregressive model and a set of auxiliary losses for training,

which complicates the training pipeline and increases the cost of development.

GAN-based model can be trained from scratch, but it provides inferior audio fidelity than WaveNet.

WaveGlow can be trained directly with maximum likelihood,

but the model has huge number of parameters (e.g., 88M parameters) to reach the comparable fidelity of audio as WaveNet.

Today, we’re excited to announce WaveFlow (paper, audio samples), the latest milestone of audio synthesis research at Baidu.

It features: 1) high-fidelity & ultra-fast audio synthesis, 2) simple likelihood-based training,

and 3) small memory footprint, which could not be achieved simultaneously in previous work.

Our small-footprint model (5.91M parameters) can synthesize high-fidelity speech (MOS: 4.32)

more than 40x faster than real-time on a Nvidia V100 GPU.

WaveFlow also provides a unified view of likelihood-models for raw audio,

which includes both WaveNet and WaveGlow as special cases and allow us to explicitly trade inference parallelism for model capacity.

Our paper will be presented at ICML 2020.

For more details of WaveFlow, please check out our paper: https://arxiv.org/abs/1912.01219

Audio samples are in: https://waveflow-demo.github.io/

The implementation can be accessed in Parakeet, which is a text-to-speech toolkit building on PaddlePaddle:

https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow

----------------------------

PDF摘要文章朗读效果，阅读段落：

----------------------------

Abstract

In this work, we propose WaveFlow, a small-footprint generative ﬂow for raw audio, which

is directly trained with maximum likelihood.

It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture,

while modeling the local variations using expressive autoregressive functions.

WaveFlow provides a uniﬁed view of likelihood-based models for 1-D data,

including WaveNet and WaveGlow as special cases.

It generates high-ﬁdelity speech as WaveNet,

while synthesizing several orders of magnitude faster as

it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.

Furthermore, it can signiﬁcantly reduce the likelihood gap that has existed

between autoregressive models and ﬂow-based models for efﬁcient synthesis.

Finally, our small-footprint WaveFlow has only 5.91M parameters,

which is 15× smaller than WaveGlow.

It can generate 22.05 kHz high-ﬁdelity audio 42.6× faster than real-time

(at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

----------------------------

OCR图片文字朗读效果：

项目过程详尽回放

以下操作过程已经在AI Stuidio上开放，可以在线体验，当然读者也可以尝试在自己电脑上参考运行：

https://aistudio.baidu.com/aistudio/projectdetail/676162

第一步：下载并安装工具库

安装Parakeet模型库

注意：安装完成后如果出现Parakeet模型库import报错的情况，需要重启项目才能正常import

!git

clone

https://github.com/PaddlePaddle/Parakeet

Parakeet

!pip install -e .

import nltk

nltk.download(

"punkt"

)

nltk.download(

"cmudict"

)

准备Parakeet预训练模型

需要准备的预训练模型包括：

WaveFlow模型128比特率的预训练模型
FastSpeech文字转语音预训练模型

!wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip!unzip waveflow_res128_ljspeech_ckpt_1.0.zip -d Parakeet/examples/fastspeech/!wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech_ljspeech_ckpt_1.0.zip!unzip fastspeech_ljspeech_ckpt_1.0.zip -d Parakeet/examples/fastspeech/fastspeech_ljspeech_ckpt_1.0/

安装PaddleOCR

!git clone https://gitee.com/paddlepaddle/PaddleOCR.git!cd PaddleOCR/!pip install -r requirments.txt

准备支持空格的识别预训练模型

!mkdir inference!cd inference!wget https://paddleocr.bj.bcebos.com/ch_models/ch_rec_r34_vd_crnn_enhance_infer.tar && tar xf ch_rec_r34_vd_crnn_enhance_infer.tar!wget https://paddleocr.bj.bcebos.com/ch_models/ch_det_r50_vd_db_infer.tar && tar xf ch_det_r50_vd_db_infer.tar

%cd ../..

安装Beautiful Soup等工具库

!pip

install

bs4

!pip

install

xlwt

!pip

install

xlrd

!pip

install

lxml

!pip

install

w3lib

!pip

install

pdfminer3k

第二步：解析文章内容

对HTML网页文章、普通PDF和图片文字三种典型场景的文章内容解析方法如下。

解析HTML文章：

这里使用requests模块和Beautiful Soup库对Baidu Research上关于WaveFlow的介绍 WaveFlow: A Compact Flow-Based Model for Raw Audio 页面内容进行爬取和清洗。

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。

它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

参考链接：

Beautiful Soup 4.4.0 文档
Python beautiful soup解析html获得数据
BeautifulSoup中find和find_all的使用
利用BeautifulSoup去除HTML指定标签和去除注释
AI Studio项目：《青春有你2》选手信息爬取

import

json

import

requests

import

datetime

from

bs4

import

BeautifulSoup

import

defprint_crawl_data(url, save_path):

"""

爬取指定url的Html页面内容并打印

"""

headers = {

'User-Agent'

'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

}

url = url

try

response = requests.get(url,headers=headers)

# print(response.status_code)#将一段文档传入BeautifulSoup的构造方法,就能得到一个文档的对象, 可以传入一段字符串

soup = BeautifulSoup(response.text)

# [s.extract() for s in soup('a')]# 按css搜索# #返回的是class为'style':'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'的<span>所有标签

texts = soup.find_all(

'span'

'style'

'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'

})

for

text

texts:

#对当前节点前面的标签和字符串进行查找，并指定查找内容为文本# print(text.text)with

open(

'%s'

% (save_path),

'a'

)

result = text.text

(result)

f.write(result +

"\n"

)

except

Exception

print(e)

print_crawl_data(

'http://research.baidu.com/Blog/index-view?id=139'

'article.txt'

)

Audio synthesis has a variety of applications, including text-to-speech (TTS), music generation, virtual assistant,

and

digital content creation. In recent years, deep neural network has obtained noticeable successes

for

synthesizing raw audio

high-fidelity speech

and

music generation. One of the most successful examples are autoregressive models (e.g., WaveNet). However, they sequentially generate high temporal resolution of raw waveform (e.g.,

kHz) at synthesis, which are prohibitively slow

for

real-time applications.

Many researchers

from

various organizations have spent considerable effort to develop parallel generative models

for

raw audio. Parallel WaveNet

and

ClariNet could generate high-fidelity audio

parallel, but they require distillation

from

a pretrained autoregressive model

and

a set of auxiliary losses

for

training, which complicates the training pipeline

and

increases the cost of development. GAN-based model can be trained

from

scratch, but it provides inferior audio fidelity than WaveNet. WaveGlow can be trained directly

with

maximum likelihood, but the model has huge number of parameters (e.g.,

M parameters) to reach the comparable fidelity of audio

WaveNet.

Today, we’re excited to announce WaveFlow (paper, audio samples), the latest milestone of audio synthesis research at Baidu. It features:

) high-fidelity & ultra-fast audio synthesis,

) simple likelihood-based training,

and3

) small memory footprint, which could

not

be achieved simultaneously

previous work. Our small-footprint model (

5.91

M parameters) can synthesize high-fidelity speech (MOS:

4.32

) more than

x faster than real-time on a Nvidia V100 GPU. WaveFlow also provides a unified view of likelihood-models

for

raw audio, which includes both WaveNet

and

WaveGlow

special cases

and

allow us to explicitly trade inference parallelism

for

model capacity.

Our paper will be presented at ICML

2020.

For more details of WaveFlow, please check out our paper: https://arxiv.org/abs/

1912.01219

Audio samples are

: https://waveflow-demo.github.io/

The implementation can be accessed

Parakeet, which

a text-to-speech toolkit building on PaddlePaddle: https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow

with

open(

'article.txt'

'r'

,encoding =

'utf-8'

)

fr,open(

'article2.txt'

'w'

,encoding =

'utf-8'

)

fd:

for

text

fr.readlines():

text.split():

fd.write(text)

print(

'完成去空行处理...'

)

完成去空行处理...

with

open(

'article2.txt'

'r'

,encoding =

'utf-8'

)

fr,open(

'article3.txt'

'w'

,encoding =

'utf-8'

)

fd:

for

text

fr.readlines():

text = text.replace(

'.'

'.\n'

)

fd.write(text)

print(

'完成去换行处理...'

)

注意：由于Parakeet模型库的预训练模型都是在短句上训练的，为保证较好的语音合成效果，还需要手动对txt文件进一步整理，最终修改效果可查看article3.txt文件。

解析PDF文章

这里使用pdfminer解析PDF（注：普通PDF，不能解析的PDF需要转成图片进行OCR识别），另外需注意在python3中，需要安装的工具库是pdfminer3k。

在示例中，将对 WaveFlow: A Compact Flow-based Model for Raw Audio 这篇论文的PDF文件（下载后重命名为waveflow.pdf）进行解析，将摘要提取出来，为后续文字转语音（TTS）做好准备。

参考链接：

Python使用pdfminer解析PDF
Python去除文本文件中的空行

import

urllib

import

importlib,sys

importlib.reload(sys)

from

pdfminer.pdfparser

import

PDFParser, PDFDocument

from

pdfminer.pdfdevice

import

PDFDevice

from

pdfminer.pdfinterp

import

PDFResourceManager, PDFPageInterpreter

from

pdfminer.converter

import

PDFPageAggregator

from

pdfminer.layout

import

LTTextBoxHorizontal, LAParams

from

pdfminer.pdfinterp

import

PDFTextExtractionNotAllowed

defparse(DataIO, save_path):#用文件对象创建一个PDF文档分析器

parser = PDFParser(DataIO)

#创建一个PDF文档

doc = PDFDocument()

#分析器和文档相互连接

parser.set_document(doc)

doc.set_parser(parser)

#提供初始化密码，没有默认为空

doc.initialize()

#检查文档是否可以转成TXT，如果不可以就忽略ifnot

doc.is_extractable:

raise

PDFTextExtractionNotAllowed

else

#创建PDF资源管理器，来管理共享资源

rsrcmagr = PDFResourceManager()

#创建一个PDF设备对象

laparams = LAParams()

#将资源管理器和设备对象聚合

device = PDFPageAggregator(rsrcmagr, laparams=laparams)

#创建一个PDF解释器对象

interpreter = PDFPageInterpreter(rsrcmagr, device)

#循环遍历列表，每次处理一个page内容#doc.get_pages()获取page列表for

page

doc.get_pages():

interpreter.process_page(page)

#接收该页面的LTPage对象

layout = device.get_result()

#这里的layout是一个LTPage对象里面存放着page解析出来的各种对象#一般包括LTTextBox，LTFigure，LTImage，LTTextBoxHorizontal等等一些对像#想要获取文本就得获取对象的text属性for

layout:

try

(isinstance(x, LTTextBoxHorizontal)):

with

open(

'%s'

% (save_path),

'a'

)

result = x.get_text()

(result)

f.write(result +

"\n"

)

except

print(

"Failed"

)

#解析本地PDF文本，保存到本地TXTwith

open(

'waveflow.pdf'

'rb'

)

pdf_html:

parse(pdf_html,

'pdf2text_output.txt'

)

with

open(

'pdf2text_output.txt'

'r'

,encoding =

'utf-8'

)

fr,open(

'abstract.txt'

'w'

,encoding =

'utf-8'

)

fd:

for

text

fr.readlines()[

:]:

text.split():

fd.write(text)

print(text)

print(

'摘要打印完成'

)

Abstract

In this work, we propose WaveFlow, a small-

footprint generative ﬂow

for

raw audio, which

directly trained

with

maximum likelihood. It

handles the long-range structure of

-D wave-

form

with

a dilated

-D convolutional architec-

ture,

while

modeling the local variations using

expressive autoregressive functions. WaveFlow

provides a uniﬁed view of likelihood-based mod-

els

for1

-D data, including WaveNet

and

Wave-

Glow

special cases. It generates high-ﬁdelity

speech

WaveNet,

while

synthesizing several

orders of magnitude faster

it only requires a

few sequential steps to generate very long wave-

forms

with

hundreds of thousands of time-steps.

Furthermore, it can signiﬁcantly reduce the likeli-

hood gap that has existed between autoregressive

models

and

ﬂow-based models

for

efﬁcient syn-

thesis. Finally, our small-footprint WaveFlow has

only

5.91

M parameters, which

is15

× smaller

than WaveGlow. It can generate

22.05

kHz high-

ﬁdelity audio

42.6

× faster than real-time (at a rate

939.3

kHz) on a V100 GPU without engineered

inference kernels.

摘要打印完成

注意：为保证较好的语音合成效果，论文中换行连字符需要手动处理，最终修改效果可查看abstract.txt文件。

OCR识别图片中英文语句

对PaddleOCR/tools/infer/predict_system.py中的main()函数下面这一部分稍作修改，只识别文字，比较直观：

drop_score =

0.5

dt_num = len(dt_boxes)

for

dno

range(dt_num):

text, score = rec_res[dno]

score >= drop_score:

# 只打印文本，并存储为txt文件# text_str = "%s, %.3f" % (text, score)

with open(

'../ocr_text.txt'

'a'

)

text_str =

"%s"

% (text)

f.write(text_str +

"\n"

)

(text_str)

!cd /home/aistudio/PaddleOCR

/home/aistudio/PaddleOCR

# 找一些英文名言的图片

!wget https:

quotefancy.com/media/wallpaper/

3840

x2160/

50594

-Francis-Bacon-Quote-Knowledge-

-power.jpg --

-check-certificate

!wget https:

www.quotemaster.org/images/

2423

b4151b7283c4570e2967fbf022cf.jpg

!wget https:

www.promptaconsultinggroup.com/wp-content/uploads/

2018

/Focus-

-Results.jpg

!wget https:

quotefancy.com/media/wallpaper/

1600

x900/

50583

-Francis-Bacon-Quote-Knowledge-

-power.jpg --

-check-certificate

!wget https:

quotefancy.com/media/wallpaper/

3840

x2160/

2347129

-William-Shakespeare-Quote-To-be-

not

-to-be-that-

-the-question.jpg --

-check-certificate

-2020-08-0219

-- https:

www.promptaconsultinggroup.com/wp-content/uploads/

2018

/Focus-

-Results.jpg

Resolving www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)...

67.43.226.3

Connecting to www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)|

67.43.226.3

443.

.. connected.

HTTP request sent, awaiting response...

200

Length:

883254

(

863

K) [image/jpeg]

Saving to: ‘Focus-

-Results.jpg’

Focus-

-Results.jp

100

%[==================

]

862.55

11.6

KB/s

in72

2020-08-0219

(

12.0

KB/s) - ‘Focus-

-Results.jpg’ saved [

883254

]

!python tools/infer/predict_system.py \

--image_dir=

"50594-Francis-Bacon-Quote-Knowledge-is-power.jpg"

--det_model_dir=

"./inference/ch_det_r50_vd_db/"

--rec_model_dir=

"./inference/ch_rec_r34_vd_crnn_enhance/"

--use_space_char=True

dt_boxes num :

, elapse :

0.02082991600036621

rec_res num :

, elapse :

0.019023895263671875

Predict time

of50594

-Francis-Bacon-Quote-Knowledge-

-power.jpg:

0.097

Knowledge

power

Francis

Bacon

quotefancy

The visualized image saved

./inference_results/

50594

-Francis-Bacon-Quote-Knowledge-

-power.jpg

OCR文字识别效果：

第三步：文字转语音

在该步骤中，需要对示例的Parakeet/examples/fastspeech/synthesis.py进行修改，关键就是将指定语句输入的效果测试修改为按行读取txt文件生成语音。synthesis()函数的修改如下，完成修改内容请查看synthesis.py文件

def synthesis(args):

local_rank = dg.parallel.Env().local_rank

place = (fluid.CUDAPlace(local_rank)

args.use_gpu

else

fluid.CPUPlace())

fluid.enable_dygraph(place)

with

open

(args.

config

) as f:

cfg = yaml.

load

(f, Loader=yaml.Loader)

ifnotos

path

.exists(args.

output

.mkdir(args.

output

)

writer = SummaryWriter(

path

.join(args.

output

'log'

))

model = FastSpeech(cfg[

'network'

], num_mels=cfg[

'audio'

][

'num_mels'

])

# Load parameters.

global_step =

.load_parameters(

model=model, checkpoint_path=args.checkpoint)

model.eval()

# 按行读取txt文本并生成语音

for

i,line

enumerate(

open

(args.text_input)):

text_input = line

text = np.asarray(text_to_sequence(text_input))

text = np.expand_dims(text, axis=

)

pos_text = np.arange(

, text.shape[

] +

)

pos_text = np.expand_dims(pos_text, axis=

)

text = dg.to_variable(text).astype(np.int64)

pos_text = dg.to_variable(pos_text).astype(np.int64)

_, mel_output_postnet = model(text, pos_text, alpha=args.alpha)

args.vocoder ==

'griffin-lim'

#synthesis use griffin-lim

wav = synthesis_with_griffinlim(mel_output_postnet, cfg[

'audio'

])

elif args.vocoder ==

'waveflow'

wav = synthesis_with_waveflow(mel_output_postnet, args,

args.checkpoint_vocoder, place)

else

(

'vocoder error, we only support griffinlim and waveflow, but recevied %s.'

% args.vocoder)

writer.add_audio(text_input +

'('

+ args.vocoder +

')'

, wav,

cfg[

'audio'

][

'sr'

])

ifnotos

path

.exists(

path

.join(args.

output

'samples'

)):

.mkdir(

path

.join(args.

output

'samples'

))

write

(

path

.join(

path

.join(args.

output

'samples'

), args.vocoder + str(i) +

'.wav'

cfg[

'audio'

][

'sr'

], wav)

(

"Synthesis completed !!!"

)

writer.

()

!export CUDA_VISIBLE_DEVICES=

env: CUDA_VISIBLE_DEVICES=

!cd /home/aistudio/Parakeet/examples/fastspeech

/home/aistudio/Parakeet/examples/fastspeech

使用WaveFlow作为声码器朗读HTML文章

!python synthesis.py \

--use_gpu=1 \--alpha=1.0 \--checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' \--config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \--output='./synthesis' \--vocoder='waveflow' \--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \--text_input='/home/aistudio/article3.txt'

{

'alpha'

1.0

'checkpoint'

'./fastspeech_ljspeech_ckpt_1.0/step-162000'

'checkpoint_vocoder'

'./waveflow_res128_ljspeech_ckpt_1.0/step-2000000'

'config'

'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml'

'config_vocoder'

'./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml'

'output'

'./synthesis'

'text_input'

'/home/aistudio/article3.txt'

'use_gpu'

'vocoder'

'waveflow'

}

验证文字转语音效果

生成的TTS音频保存在

Parakeet/examples/fastspeech/synthesis/samples文件夹下，可以选择几段音频验证效果

import

IPython

IPython.display.Audio(

'synthesis/samples/waveflow3.wav'

)

使用ffmpeg合并
生成的音频文件

由于前面是通过对文本逐行扫描生成的音频文件，如果希望听到完整的文章段落，就需要将生成的音频文件按顺序拼接。

用ffmpeg拼接音频前需要先准备一个list.txt文件，格式如下：

file 'path/to/file1'

file 'path/to/file2'

file 'path/to/file3'

然后执行命令 ffmpeg -f concat -i list.txt -c copy "outputfile"完成拼接

# 生成list文件for

i,line

enumerate(open(

'/home/aistudio/article3.txt'

)):

with open(

'waveflow_article3.txt'

'a'

) as f:

result =

'file synthesis/samples/waveflow'

+ str(i) +

'.wav'

f.write(result +

"\n"

)

# 音频拼接

!ffmpeg -f concat -i waveflow_article3.txt -c copy

'waveflow_article3.wav'

built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 20160609

configuration: --prefix=/usr --extra-version=0ubuntu0.16.04.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --

enable

-gpl --

enable

-shared --

disable

-stripping --

disable

-decoder=libopenjpeg --

disable

-decoder=libschroedinger --

enable

-avresample --

enable

-avisynth --

enable

-gnutls --

enable

-ladspa --

enable

-libass --

enable

-libbluray --

enable

-libbs2b --

enable

-libcaca --

enable

-libcdio --

enable

-libflite --

enable

-libfontconfig --

enable

-libfreetype --

enable

-libfribidi --

enable

-libgme --

enable

-libgsm --

enable

-libmodplug --

enable

-libmp3lame --

enable

-libopenjpeg --

enable

-libopus --

enable

-libpulse --

enable

-librtmp --

enable

-libschroedinger --

enable

-libshine --

enable

-libsnappy --

enable

-libsoxr --

enable

-libspeex --

enable

-libssh --

enable

-libtheora --

enable

-libtwolame --

enable

-libvorbis --

enable

-libvpx --

enable

-libwavpack --

enable

-libwebp --

enable

-libx265 --

enable

-libxvid --

enable

-libzvbi --

enable

-openal --

enable

-opengl --

enable

-x11grab --

enable

-libdc1394 --

enable

-libiec61883 --

enable

-libzmq --

enable

-frei0r --

enable

-libx264 --

enable

-libopencv

libavutil 54. 31.100 / 54. 31.100

libavcodec 56. 60.100 / 56. 60.100

libavformat 56. 40.101 / 56. 40.101

libavdevice 56. 4.100 / 56. 4.100

libavfilter 5. 40.101 / 5. 40.101

libavresample 2. 1. 0 / 2. 1. 0

libswscale 3. 1.101 / 3. 1.101

libswresample 1. 2.101 / 1. 2.101

libpostproc 53. 3.100 / 53. 3.100

[0;33mGuessed Channel Layout

for

Input Stream

#0.0 : mono

[0mInput

#0, concat, from 'waveflow_article3.txt':

Duration: N/A, start: 0.000000, bitrate: 705 kb/s

Stream

#0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, 1 channels, flt, 705 kb/s

Output

#0, wav, to 'waveflow_article3.wav':

Metadata:

ISFT : Lavf56.40.101

Stream

#0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, mono, 705 kb/s

Stream mapping:

Stream

#0:0 -> #0:0 (copy)

Press [q] to stop, [?]

forhelp

size= 16235kB time=00:03:08.49 bitrate= 705.6kbits/s

video:0kB audio:16235kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000686%

使用Griffin-Lim算法
作为声码器朗读HTML文章

!python synthesis.py \

--use_gpu=

--alpha=

1.0

--checkpoint=

'./fastspeech_ljspeech_ckpt_1.0/step-162000'

--config=

'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml'

--output=

'./synthesis'

--text_input=

'/home/aistudio/article3.txt'

{

'alpha'

1.0

'checkpoint'

'./fastspeech_ljspeech_ckpt_1.0/step-162000'

'checkpoint_vocoder'

None

'config'

'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml'

'config_vocoder'

None

'output'

'./synthesis'

'text_input'

'/home/aistudio/article3.txt'

'use_gpu'

'vocoder'

'griffin-lim'

}

验证文字转语音效果

import

IPython

IPython.display.Audio(

'synthesis/samples/griffin-lim3.wav'

)

使用ffmpeg合并
生成的音频文件

# 生成list文件for

line

inenumerate

(

open('/home/aistudio/article3.txt'

)):

with

open

(

'griffin-lim_article3.txt', 'a'

)

result

'file synthesis/samples/griffin-lim'

+ str(i) +

'.wav'

f.write(result +

"\n"

)

# 音频拼接

!ffmpeg -f concat -i griffin-lim_article3.txt -c copy

'griffin-lim_article3.wav'

论文摘要和OCR文字
转语音效果

abstract.txt和ocr_text.txt的TTS实现过程和上面的article3.txt完全一致，唯一不同在于OCR识别最终合成的音频文件比较小，可以直接在Notebook中查看效果。

1. 论文摘要TTS：

!python synthesis.py \

# 生成list文件

for

i,line

enumerate(

open

(

'/home/aistudio/abstract.txt'

)):

with

open

(

'waveflow_abstract.txt'

'a'

) as f:

result =

'file synthesis/samples/waveflow'

+ str(i) +

'.wav'

write

(result +

"\n"

)

# 音频拼接

!ffmpeg -f

concat

-i waveflow_abstract.txt -c copy

'waveflow_abstract.wav'

2. OCR识别TTS（Knowledge is Power）

注：ocr_text.txt中内容较少，已手动整理成一行文字。

!python synthesis.py \

--use_gpu=

--alpha=

1.0

--checkpoint=

'./fastspeech_ljspeech_ckpt_1.0/step-162000'

--config=

'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml'

--output=

'./synthesis'

--vocoder=

'waveflow'

--config_vocoder=

'./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml'

--checkpoint_vocoder=

'./waveflow_res128_ljspeech_ckpt_1.0/step-2000000'

--text_input=

'/home/aistudio/ocr_text.txt'

{

'alpha'

1.0

'checkpoint'

'./fastspeech_ljspeech_ckpt_1.0/step-162000'

'checkpoint_vocoder'

'./waveflow_res128_ljspeech_ckpt_1.0/step-2000000'

'config'

'./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml'

'config_vocoder'

'./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml'

'output'

'./synthesis'

'text_input'

'/home/aistudio/ocr_text.txt'

'use_gpu'

'vocoder'

'waveflow'

}

[

checkpoint

] Rank

: loaded model

from

./fastspeech_ljspeech_ckpt_1

/step

-162000.

pdparams

[

checkpoint

] Rank

: loaded model

from

./waveflow_res128_ljspeech_ckpt_1

/step

-2000000.

pdparams

Synthesis completed !!!

!mv synthesis/samples/waveflow0.wav ./ocr.wav

import IPython

IPython.display.Audio(

'ocr.wav'

)

小结：
TTS效果如何进一步提升？

1. 找到更好的智能排版办法，本项目虽然使用Python对HTML和PDF解析后的文章进行了部分处理，但最后一个环节的排版调整还是手动完成的，TTS效果才比较好。需要进一步结合正则表达式等NLP处理技术，优化自动排版（想必这块也是业界难题，比如最新的Edge浏览器也存在排版问题）。

2. Parakeet的预训练模型只是在LJSpeech数据集上训练得到的，可以考虑加入更多的语音数据集继续训练，得到更加丰富的发音风格和更准确的发音效果，使用Parakeet的训练过程可参考 Parakeet：手把手教你训练语音合成模型（脚本任务、Notebook）。

3. PaddleOCR提供的预训练模型在英文识别上效果可以进一步提升，可以尝试用PaddleOCR在更多英文OCR数据集上训练。（后续将更新）

更多资源

完整项目包括项目代码、文字文件等均公开在AIStudio上，欢迎Fork。

https://aistudio.baidu.com/aistudio/projectdetail/676162

如果您想详细了解更多飞桨的相关内容，请参阅以下文档。

·Parakeet项目地址·

https://github.com/PaddlePaddle/Parakeet

·PaddleOCR 项目地址·

GitHub:

https://github.com/PaddlePaddle/PaddleOCR

Gitee:

https://gitee.com/paddlepaddle/PaddleOCR

▼ 往期精彩回顾

▼

Linux 内核对 Rust 的支持有新进展，双方进行深入探讨

送书｜爱上读书，每天都是读书日！10本技术书（云计算、大数据等）任你选！

为破除“谷歌控制说”，Istio 重组指导委员会

挑战树莓派？首个运行 Linux 系统的 RISC-V 架构微型计算机 PicoRio 发布

29 年超 100 万次 commit，Linux 内核何以发展至今？

觉得不错，请点个在看呀

继续阅读

阅读原文

音频慎入！枕边女友每天读论文哄我睡觉

如何让飞桨可以自己“读”论文，也就是实现文字转语音的任务？简单分解一下，通过实现下面三个场景的文字转语音（TTS,Text-to-Speech）任务就可以做到：

HTML文章段落朗读效果：

以下操作过程已经在AI Stuidio上开放，可以在线体验，当然读者也可以尝试在自己电脑上参考运行：

第一步：下载并安装工具库

安装Parakeet模型库

准备Parakeet预训练模型

安装PaddleOCR

准备支持空格的识别预训练模型

安装Beautiful Soup等工具库

第二步：解析文章内容

解析HTML文章：

解析PDF文章

对PaddleOCR/tools/infer/predict_system.py中的main()函数下面这一部分稍作修改，只识别文字，比较直观：

第三步：文字转语音

使用WaveFlow作为声码器朗读HTML文章

生成的TTS音频保存在

Parakeet/examples/fastspeech/synthesis/samples文件夹下，可以选择几段音频验证效果

由于前面是通过对文本逐行扫描生成的音频文件，如果希望听到完整的文章段落，就需要将生成的音频文件按顺序拼接。

abstract.txt和ocr_text.txt的TTS实现过程和上面的article3.txt完全一致，唯一不同在于OCR识别最终合成的音频文件比较小，可以直接在Notebook中查看效果。