
Functions that returns high-level layers that are composed using other Kapre layers.


Functions in this module returns a composed Keras layer, which is an instance of keras.Sequential or keras.Functional. They are not compatible with keras.load_model() if save_format == ‘h5’. The solution would be to use save_format=’tf’ (.pb file format). Or, you can decompose the returned layers and add it to your model manually. E.g.,

your_model = keras.Sequentual()
composed_melgram_layer = kapre.composed.get_melspectrogram_layer(input_shape=(44100, 1))
for layer in composed_melgram_layer.layers:
kapre.composed.get_stft_magnitude_layer(input_shape=None, n_fft=2048, win_length=None, hop_length=None, window_name=None, pad_begin=False, pad_end=False, return_decibel=False, db_amin=1e-05, db_ref_value=1.0, db_dynamic_range=80.0, input_data_format='default', output_data_format='default', name='stft_magnitude')[source]

A function that returns a stft magnitude layer. The layer is a keras.Sequential model consists of STFT, Magnitude, and optionally MagnitudeToDecibel.

  • input_shape (None or tuple of integers) – input shape of the model. Necessary only if this melspectrogram layer is is the first layer of your model (see keras.model.Sequential() for more details)
  • n_fft (int) – number of FFT points in STFT
  • win_length (int) – window length of STFT
  • hop_length (int) – hop length of STFT
  • window_name (str or None) – Name of tf.signal function that returns a 1D tensor window that is used in analysis. Defaults to hann_window which uses tf.signal.hann_window. Window availability depends on Tensorflow version. More details are at kapre.backend.get_window().
  • pad_begin (bool) – Whether to pad with zeros along time axis (length: win_length - hop_length). Defaults to False.
  • pad_end (bool) – whether to pad the input signal at the end in STFT.
  • return_decibel (bool) – whether to apply decibel scaling at the end
  • db_amin (float) – noise floor of decibel scaling input. See MagnitudeToDecibel for more details.
  • db_ref_value (float) – reference value of decibel scaling. See MagnitudeToDecibel for more details.
  • db_dynamic_range (float) – dynamic range of the decibel scaling result.
  • input_data_format (str) – the audio data format of input waveform batch. ‘channels_last’ if it’s (batch, time, channels) ‘channels_first’ if it’s (batch, channels, time) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • output_data_format (str) – the data format of output melspectrogram. ‘channels_last’ if you want (batch, time, frequency, channels) ‘channels_first’ if you want (batch, channels, time, frequency) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • name (str) – name of the returned layer


STFT magnitude represents a linear-frequency spectrum of audio signal and probably the most popular choice when it comes to audio analysis in general. By using magnitude, this layer discard the phase information, which is generally known to be irrelevant to human auditory perception.


For audio analysis (when the output is tag/label/etc), we’d like to recommend to set return_decibel=True. Decibel scaling is perceptually plausible and numerically stable (related paper: A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging) Many music, speech, and audio applications have used this log-magnitude STFT, e.g., Learning to Pinpoint Singing Voice from Weakly Labeled Examples, Joint Beat and Downbeat Tracking with Recurrent Neural Networks, and many more.

For audio processing (when the output is audio signal), it might be better to use STFT as it is (return_decibel=False). Example: Singing voice separation with deep U-Net convolutional networks. This is because decibel scaling is has some clipping at the noise floor which is irreversible. One may use log(1+X) instead of log(X) to avoid the clipping but it is not included in Kapre at the moment.


input_shape = (2048, 1)  # mono signal, audio is channels_last
stft_mag = get_stft_magnitude_layer(input_shape=input_shape, n_fft=1024, return_decibel=True,
    input_data_format='channels_last', output_data_format='channels_first')
model = Sequential()
# now the shape is (batch, ch=1, n_frame=3, n_freq=513) because output_data_format is 'channels_first'
# and the dtype is float
kapre.composed.get_melspectrogram_layer(input_shape=None, n_fft=2048, win_length=None, hop_length=None, window_name=None, pad_begin=False, pad_end=False, sample_rate=22050, n_mels=128, mel_f_min=0.0, mel_f_max=None, mel_htk=False, mel_norm='slaney', return_decibel=False, db_amin=1e-05, db_ref_value=1.0, db_dynamic_range=80.0, input_data_format='default', output_data_format='default', name='melspectrogram')[source]

A function that returns a melspectrogram layer, which is a keras.Sequential model consists of STFT, Magnitude, ApplyFilterbank(_mel_filterbank), and optionally MagnitudeToDecibel.

  • input_shape (None or tuple of integers) – input shape of the model. Necessary only if this melspectrogram layer is is the first layer of your model (see keras.model.Sequential() for more details)
  • n_fft (int) – number of FFT points in STFT
  • win_length (int) – window length of STFT
  • hop_length (int) – hop length of STFT
  • window_name (str or None) – Name of tf.signal function that returns a 1D tensor window that is used in analysis. Defaults to hann_window which uses tf.signal.hann_window. Window availability depends on Tensorflow version. More details are at kapre.backend.get_window().
  • pad_begin (bool) – Whether to pad with zeros along time axis (length: win_length - hop_length). Defaults to False.
  • pad_end (bool) – whether to pad the input signal at the end in STFT.
  • sample_rate (int) – sample rate of the input audio
  • n_mels (int) – number of mel bins in the mel filterbank
  • mel_f_min (float) – lowest frequency of the mel filterbank
  • mel_f_max (float) – highest frequency of the mel filterbank
  • mel_htk (bool) – whether to follow the htk mel filterbank fomula or not
  • mel_norm ('slaney' or int) – normalization policy of the mel filterbank triangles
  • return_decibel (bool) – whether to apply decibel scaling at the end
  • db_amin (float) – noise floor of decibel scaling input. See MagnitudeToDecibel for more details.
  • db_ref_value (float) – reference value of decibel scaling. See MagnitudeToDecibel for more details.
  • db_dynamic_range (float) – dynamic range of the decibel scaling result.
  • input_data_format (str) – the audio data format of input waveform batch. ‘channels_last’ if it’s (batch, time, channels) ‘channels_first’ if it’s (batch, channels, time) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • output_data_format (str) – the data format of output melspectrogram. ‘channels_last’ if you want (batch, time, frequency, channels) ‘channels_first’ if you want (batch, channels, time, frequency) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • name (str) – name of the returned layer


Melspectrogram is originally developed for speech applications and has been very widely used for audio signal analysis including music information retrieval. As its mel-axis is a non-linear compression of (linear) frequency axis, a melspectrogram can be an efficient choice as an input of a machine learning model. We recommend to set return_decibel=True.

References: Automatic tagging using deep convolutional neural networks, Deep content-based music recommendation, CNN Architectures for Large-Scale Audio Classification, Multi-label vs. combined single-label sound event detection with deep neural networks, Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification, and way too many speech applications.


input_shape = (2, 2048)  # stereo signal, audio is channels_first
melgram = get_melspectrogram_layer(input_shape=input_shape, n_fft=1024, return_decibel=True,
    n_mels=96, input_data_format='channels_first', output_data_format='channels_last')
model = Sequential()
# now the shape is (batch, n_frame=3, n_mels=96, n_ch=2) because output_data_format is 'channels_last'
# and the dtype is float
kapre.composed.get_log_frequency_spectrogram_layer(input_shape=None, n_fft=2048, win_length=None, hop_length=None, window_name=None, pad_begin=False, pad_end=False, sample_rate=22050, log_n_bins=84, log_f_min=None, log_bins_per_octave=12, log_spread=0.125, return_decibel=False, db_amin=1e-05, db_ref_value=1.0, db_dynamic_range=80.0, input_data_format='default', output_data_format='default', name='log_frequency_spectrogram')[source]

A function that returns a log-frequency STFT layer, which is a keras.Sequential model consists of STFT, Magnitude, ApplyFilterbank(_log_filterbank), and optionally MagnitudeToDecibel.

  • input_shape (None or tuple of integers) – input shape of the model if this melspectrogram layer is is the first layer of your model (see keras.model.Sequential() for more details)
  • n_fft (int) – number of FFT points in STFT
  • win_length (int) – window length of STFT
  • hop_length (int) – hop length of STFT
  • window_name (str or None) – Name of tf.signal function that returns a 1D tensor window that is used in analysis. Defaults to hann_window which uses tf.signal.hann_window. Window availability depends on Tensorflow version. More details are at kapre.backend.get_window().
  • pad_begin (bool) – Whether to pad with zeros along time axis (length: win_length - hop_length). Defaults to False.
  • pad_end (bool) – whether to pad the input signal at the end in STFT.
  • sample_rate (int) – sample rate of the input audio
  • log_n_bins (int) – number of the bins in the log-frequency filterbank
  • log_f_min (float) – lowest frequency of the filterbank
  • log_bins_per_octave (int) – number of bins in each octave in the filterbank
  • log_spread (float) – spread constant (Q value) in the log filterbank.
  • return_decibel (bool) – whether to apply decibel scaling at the end
  • db_amin (float) – noise floor of decibel scaling input. See MagnitudeToDecibel for more details.
  • db_ref_value (float) – reference value of decibel scaling. See MagnitudeToDecibel for more details.
  • db_dynamic_range (float) – dynamic range of the decibel scaling result.
  • input_data_format (str) – the audio data format of input waveform batch. ‘channels_last’ if it’s (batch, time, channels) ‘channels_first’ if it’s (batch, channels, time) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • output_data_format (str) – the data format of output mel spectrogram. ‘channels_last’ if you want (batch, time, frequency, channels) ‘channels_first’ if you want (batch, channels, time, frequency) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • name (str) – name of the returned layer


Log-frequency spectrogram is similar to melspectrogram but its frequency axis is perfectly linear to octave scale. For some pitch-related applications, a log-frequency spectrogram can be a good choice.


input_shape = (2048, 2)  # stereo signal, audio is channels_last
logfreq_stft_mag = get_log_frequency_spectrogram_layer(
    input_shape=input_shape, n_fft=1024, return_decibel=True,
    log_n_bins=84, input_data_format='channels_last', output_data_format='channels_last')
model = Sequential()
# now the shape is (batch, n_frame=3, n_bins=84, n_ch=2) because output_data_format is 'channels_last'
# and the dtype is float
kapre.composed.get_perfectly_reconstructing_stft_istft(stft_input_shape=None, istft_input_shape=None, n_fft=2048, win_length=None, hop_length=None, forward_window_name=None, waveform_data_format='default', stft_data_format='default', stft_name='stft', istft_name='istft')[source]

A function that returns two layers, stft and inverse stft, which would be perfectly reconstructing pair.

  • stft_input_shape (tuple) – Input shape of single waveform. Must specify this if the returned stft layer is going to be used as first layer of a Sequential model.
  • istft_input_shape (tuple) – Input shape of single STFT. Must specify this if the returned istft layer is going to be used as first layer of a Sequential model.
  • n_fft (int) – Number of FFTs. Defaults to 2048
  • win_length (int or None) – Window length in sample. Defaults to n_fft.
  • hop_length (int or None) – Hop length in sample between analysis windows. Defaults to n_fft // 4 following librosa.
  • forward_window_name (function or None) – Name of tf.signal function that returns a 1D tensor window that is used. Defaults to hann_window which uses tf.signal.hann_window. Window availability depends on Tensorflow version. More details are at kapre.backend.get_window().
  • waveform_data_format (str) – The audio data format of waveform batch. ‘channels_last’ if it’s (batch, time, channels) ‘channels_first’ if it’s (batch, channels, time) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • stft_data_format (str) – The data format of STFT. ‘channels_last’ if you want (batch, time, frequency, channels) ‘channels_first’ if you want (batch, channels, time, frequency) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • stft_name (str) – name of the returned STFT layer
  • istft_name (str) – name of the returned ISTFT layer


Without a careful setting, tf.signal.stft and tf.signal.istft is not perfectly reconstructing.


Imagine x –> STFT –> InverseSTFT –> y. The length of x will be longer than y due to the padding at the beginning and the end. To compare them, you would need to trim y along time axis.

The formula: if trim_begin = win_length - hop_length and len_signal is length of x, y_trimmed = y[trim_begin: trim_begin + len_signal, :] (in the case of channels_last).


stft_input_shape = (2048, 2)  # stereo and channels_last
stft_layer, istft_layer = get_perfectly_reconstructing_stft_istft(

unet = get_unet()  input: stft (complex value), output: stft (complex value)

model = Sequential()
model.add(stft_layer)  # input is waveform
model.add(istft_layer)  # output is also waveform
kapre.composed.get_stft_mag_phase(input_shape, n_fft=2048, win_length=None, hop_length=None, window_name=None, pad_begin=False, pad_end=False, return_decibel=False, db_amin=1e-05, db_ref_value=1.0, db_dynamic_range=80.0, input_data_format='default', output_data_format='default', name='stft_mag_phase')[source]

A function that returns magnitude and phase of input audio.

  • input_shape (None or tuple of integers) – input shape of the stft layer. Because this mag_phase is based on keras.Functional model, it is required to specify the input shape. E.g., (44100, 2) for 44100-sample stereo audio with input_data_format==’channels_last’.
  • n_fft (int) – number of FFT points in STFT
  • win_length (int) – window length of STFT
  • hop_length (int) – hop length of STFT
  • window_name (str or None) – Name of tf.signal function that returns a 1D tensor window that is used in analysis. Defaults to hann_window which uses tf.signal.hann_window. Window availability depends on Tensorflow version. More details are at kapre.backend.get_window()
  • pad_begin (bool) – Whether to pad with zeros along time axis (length: win_length - hop_length). Defaults to False.
  • pad_end (bool) – whether to pad the input signal at the end in STFT.
  • return_decibel (bool) – whether to apply decibel scaling at the end
  • db_amin (float) – noise floor of decibel scaling input. See MagnitudeToDecibel for more details.
  • db_ref_value (float) – reference value of decibel scaling. See MagnitudeToDecibel for more details.
  • db_dynamic_range (float) – dynamic range of the decibel scaling result.
  • input_data_format (str) – the audio data format of input waveform batch. ‘channels_last’ if it’s (batch, time, channels) ‘channels_first’ if it’s (batch, channels, time) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • output_data_format (str) – the data format of output mel spectrogram. ‘channels_last’ if you want (batch, time, frequency, channels) ‘channels_first’ if you want (batch, channels, time, frequency) Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format())
  • name (str) – name of the returned layer


input_shape = (2048, 3)  # stereo and channels_last
model = Sequential()
    get_stft_mag_phase(input_shape=input_shape, return_decibel=True, n_fft=1024)
# now output shape is (batch, n_frame=3, freq=513, ch=6). 6 channels = [3 mag ch; 3 phase ch]
kapre.composed.get_frequency_aware_conv2d(data_format='default', freq_aware_name='frequency_aware_conv2d', *args, **kwargs)[source]

Returns a frequency-aware conv2d layer.

  • data_format (str) – specifies the data format of batch input/output.
  • freq_aware_name (str) – name of the returned layer
  • *args – position args for keras.layers.Conv2D.
  • **kwargs – keyword args for keras.layers.Conv2D.

A sequential model of ConcatenateFrequencyMap and Conv2D.


Koutini, K., Eghbal-zadeh, H., & Widmer, G. (2019). Receptive-Field-Regularized CNN Variants for Acoustic Scene Classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019).