Skip to content

[Bug]: Misalignment of Labels and Incorrect Density Values in ax.hist() for Categorical Variables #28029

Open
@FrancescMartiEscofetQC

Description

@FrancescMartiEscofetQC

Bug summary

I have stumbled upon an issue while using ax.hist() with categorical variables and setting density = True. The issue arises from the following behaviors:

  1. Incorrect density computation: When a categorical variable is being used, the function internally converts the variable to integers (seen here). Consequently, the bin width isn't necessarily 1, leading to miscalculated density values. Underlying function np.histogram, which is used internally, takes into account the bar width while ensuring the entire area equates to 1. However, when dealing with categoricals, we'd expect the sum of the column heights to be 1.
  2. Misalignment of labels: The function doesn't position labels precisely in the center of the bars. To align the x-ticks at the centers, one could ostensibly utilize the bins that are produced by the function. However, it's challenging to unveil how the 'category to int' conversion (mentioned above) was performed. From the code here, it seems the elements in the first array are converted (in case there are multiple arrays) in the order they emerge, with the resultant map used to convert elements of the second array. If new categories are introduced, they are assigned the subsequent free integer. Unfortunately, there are no documented details about this process.

Code for reproduction

fig, ax = plt.subplots()
ax.hist(["a","b","c","d"], bins=4, density=True)

Actual outcome

(array([0.33333333, 0.33333333, 0.33333333, 0.33333333]), array([0. , 0.75, 1.5 , 2.25, 3. ]), <BarContainer object of 4 artists>)

Expected outcome

In the case the density parameter is passed I would expect that the sum of the columns would be 1 as we are dealing with categorical variables and the width of the bar should not be taken into account. Also it would be nice to be able to know how they are converted to integer to be able to plot the labels correctly in the plot.

Additional information

Proposed improvements to the above behaviors:

  • It may be beneficial to reconsider how the function calculates densities when managing categorical data, explicitly setting the bin width to 1.
  • The function could be enhanced to bijectively handle the 'category to int' conversion and provide clear documentation to make it more trustworthy.
  • A solution needs to be implemented to ensure labels are correctly positioned at the center of the bars.

Operating system

OS/X

Matplotlib Version

3.8.3

Matplotlib Backend

MacOSX

Python version

3.10.14

Jupyter version

No response

Installation

conda

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions