Drop Columns Programmatically Based on Column Values


  • 6 votes

    Drop Columns Programmatically Based on Column Values

    For wide data sets, it can be quiet a bit of work if I have to go through
    every column to see if certain data values based on which I want to drop
    the column. For a data scientist, it would be a great plus. For eg, I have
    200 plus columns and say half of those columns are either very sparsely
    populated, or have similar invalid values, and I want to to drop any such
    column/columns where 50% or more records have those invalid value

    Under Review 2 Replies Ideas and Product Feedback August 5, 2017
    Dfariaf July 12, 2016 12:35 PM

    The same applies for replacing values in several columns. If I have 200 columns where I need to replace all blank values by 0 for instance, would I have to go one by one to perform this action?

    ryan.pipkin@tekcomms.com October 26, 2015 10:57 PM

    Another example is when a column is either a duplicate or a computed version of the other columns. In general I would just blindly drop them, but in some cases it is a nice sanity check that the value is truly a duplicate of the other column or computes to the appropriate value from the source columns. If I find one that does not validate I want to inspect the data before proceeding.