Get DPI definition of image extracted from PDF

Asked

Viewed 103 times

1

I’m using the itextSharp to extract images from PDF files.

I used this code as a basis: https://psycodedeveloper.wordpress.com/2013/01/10/how-to-extract-images-from-pdf-files-using-c-and-itextsharp/

Here is my version modified to support files in memory instead of having to work with files on disk:

    /// <summary>Helper class to extract images from a PDF file. Works with the most
    /// common image types embedded in PDF files, as far as I can tell.</summary>
    /// <example>
    /// Usage example:
    /// <code>
    /// foreach (var filename in Directory.GetFiles(searchPath, "*.pdf", SearchOption.TopDirectoryOnly))
    /// {
    ///    var images = ImageExtractor.ExtractImages(filename);
    ///    var directory = Path.GetDirectoryName(filename);
    ///
    ///    foreach (var image in images)
    ///    {
    ///       image.Save(Path.Combine(directory, name));
    ///    }
    ///  }
    /// </code></example>
    public static class PdfImageExtractor
    {
        #region Methods
        #region Public Methods

        /// <summary>Checks whether a specified page of a PDF file contains images.</summary>
        /// <returns>True if the page contains at least one image; false otherwise.</returns>
        public static bool PageContainsImages(byte[] pdfFile, int pageNumber)
        {
            using (var reader = new PdfReader(pdfFile))
            {
                var parser = new PdfReaderContentParser(reader);
                ImageRenderListener listener = null;
                parser.ProcessContent(pageNumber, (listener = new ImageRenderListener()));
                return listener.Images.Count > 0;
            }
        }

        /// <summary>Extracts all images (of types that iTextSharp knows how to decode) from a PDF file.</summary>
        public static List<System.Drawing.Image> ExtractImages(byte[] pdfFile)
        {
            var images = new List<System.Drawing.Image>();

            using (var reader = new PdfReader(pdfFile))
            {
                var parser = new PdfReaderContentParser(reader);
                ImageRenderListener listener = null;
                for (var i = 1; i <= reader.NumberOfPages; i++)
                {
                    parser.ProcessContent(i, (listener = new ImageRenderListener()));
                    var index = 1;
                    if (listener.Images.Count > 0)
                    {
                        Console.WriteLine("Found {0} images on page {1}.", listener.Images.Count, i);
                        foreach (var pair in listener.Images)
                        {
                            images.Add(pair);
                            index++;
                        }
                    }
                }
                return images;
            }
        }

        /// <summary>Extracts all images (of types that iTextSharp knows how to decode)
        /// from a specified page of a PDF file.</summary>
        /// <returns>Returns a generic <see cref="List&lt;System.Drawing.Image&gt;"/>,
        /// where the key is a suggested file name, in the format: PDF filename without extension,
        /// page number and image index in the page.</returns>
        public static List<System.Drawing.Image> ExtractImages(byte[] pdfFile, int pageNumber)
        {
            var images = new List<System.Drawing.Image>();
            using (var reader = new PdfReader(pdfFile))
            {
                var parser = new PdfReaderContentParser(reader);
                ImageRenderListener listener = null;

                parser.ProcessContent(pageNumber, (listener = new ImageRenderListener()));
                int index = 1;
                if (listener.Images.Count > 0)
                {
                    Console.WriteLine("Found {0} images on page {1}.", listener.Images.Count, pageNumber);
                    foreach (System.Drawing.Image image in listener.Images)
                    {
                        images.Add(image);
                        index++;
                    }
                }

            }


            return images;
        }
        #endregion Public Methods
        #endregion Methods
    }
    internal class ImageRenderListener : IRenderListener
    {
        #region Fields
        List<System.Drawing.Image> images = new List<System.Drawing.Image>();
        #endregion Fields

        #region Properties
        public List<System.Drawing.Image> Images
        {
            get { return images; }
        }
        #endregion Properties

        #region Methods

        #region Public Methods
        public void BeginTextBlock() { }
        public void EndTextBlock() { }
        public void RenderImage(ImageRenderInfo renderInfo)
        {
            PdfImageObject image = renderInfo.GetImage();

            var imageBytes = image.GetImageAsBytes();
            var bytesType = image.GetImageBytesType();

            var fileExtension = bytesType.FileExtension;

            using (var memoryStream = new MemoryStream(imageBytes))
            {
                var drawingImage = System.Drawing.Image.FromStream(memoryStream);

                var dpiX = drawingImage.HorizontalResolution;

                this.Images.Add(drawingImage);
            }
        }
        public void RenderText(TextRenderInfo renderInfo) { }
        #endregion Public Methods
        #endregion Methods
    }

The problem is that in the method RenderImage the value of the variable dpiX always is 96. I’d like to get the original DPI image resolution, there is some way to do this?

To use the method is enough:

var pageImages = PdfImageExtractor.ExtractImages(fileBytes, pageNumber);

For each image, I tried to convert to Bitmap too, but got the same result:

var bitmapImage = new Bitmap(pageImage);
var dpiX = bitmapImage.HorizontalResolution;
var dpiY = bitmapImage.VerticalResolution;

I do not want to define a resolution on my own, I would like to get the original image resolution.

  • The original image resolution was lost when it was applied to the pdf. The resolution is an arbitrary value in order to pass pixels to centimeters (or another dimension) and is not stored in the pdf. dpi in pdf can be anything depending on the scale when applying the image.

  • Thanks for the comment. Is there any source for the statement that "the original image resolution was lost when it was applied to the pdf"? The resolution can be recalculated somehow in that case?

  • If the object PdfImageObject you don’t have that information so you won’t be able to get it.

  • The object PdfImageObject in fact does not have any direct property that makes this information available, but I imagined that there could be some other indirect way to achieve, however so far I have not found anything of the kind in my researches, so I came for help. If the @Paulosoares comment has any source I can rely on, then I believe the subject is partially closed. My search will only be for the possibility of calculating this somehow.

  • It depends on the PDF itself. The PDF may have passed through compressors that compress the resolution to lower, in this case the original PDF would always be needed; and there is no way to distinguish whether the PDF is "compressed" or not.

  • Resolution can be recalculated when the PDF is generated. This depends on how and where the PDF was generated.

Show 1 more comment
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.