Split a String into columns using regex in pandas DataFrame (original) (raw)
Last Updated : 08 Jan, 2019
Given some mixed data containing multiple values as a string, let’s see how can we divide the strings using regex and make multiple columns in Pandas DataFrame.
Method #1:
In this method we will use re.search(pattern, string, flags=0)
. Here pattern refers to the pattern that we want to search. It takes in a string with the following values:
- \w matches alphanumeric characters
- \d matches digits, which means 0-9
- \s matches whitespace characters
- \S matches non-whitespace characters
- . matches any character except the new line character \n
- * matches 0 or more instances of a pattern
import
pandas as pd
import
re
movie_data
=
[
"Name: The_Godfather Year: 1972 Rating: 9.2"
,
`` "Name: Bird_Box Year: 2018 Rating: 6.8"
,
`` "Name: Fight_Club Year: 1999 Rating: 8.8"
]
movies
=
{
"Name"
:[],
"Year"
:[],
"Rating"
:[]}
for
item
in
movie_data:
`` name_field
=
re.search(
"Name: .*"
,item)
`` if
name_field
is
not
None
:
`` name
=
re.search(
'\w*\s\w*'
,name_field.group())
`` else
:
`` name
=
None
`` movies[
"Name"
].append(name.group())
`` year_field
=
re.search(
"Year: .*"
,item)
`` if
year_field
is
not
None
:
`` year
=
re.search(
'\s\d\d\d\d'
,year_field.group())
`` else
:
`` year
=
None
`` movies[
"Year"
].append(year.group().strip())
`` rating_field
=
re.search(
"Rating: .*"
,item)
`` if
rating_field
is
not
None
:
`` rating
=
re.search(
'\s\d.\d'
,rating_field.group())
`` else
:
`` rating
-
None
`` movies[
"Rating"
].append(rating.group().strip())
df
=
pd.DataFrame(movies)
print
(df)
Output:
Explanation:
- In the code above, we use a for loop to iterate through movie data so we can work with each movie in turn. We create a dictionary, movies, that will hold all the details of each detail, such as the rating and name.
- We then find the entire Name field using the
re.search()
function. The . means any character except \n, and * extends it to the end of the line. Assign this to the variable name_field. - But, data isn’t always straightforward. It can contain surprises. For instance, what if there’s no Name: field? The script would throw an error and break. We pre-empt errors from this scenario and check for a not None case.
- Again we use the re.search() function to extract the final required string from the name_field. For the name we use \w* to represent the first word, \s to represent the space in between and \w* for the second word.
- Do the same for year and rating and get the final required dictionary.
Method #2:
To break up the string we will use Series.str.extract(pat, flags=0, expand=True)
function. Here pat refers to the pattern that we want to search for.
import
pandas as pd
dict
=
{
'movie_data'
:[
'The Godfather 1972 9.2'
,
`` 'Bird Box 2018 6.8'
,
`` 'Fight Club 1999 8.8'
] }
df
=
pd.DataFrame(
dict
)
df[
'Name'
]
=
df[
'movie_data'
].
str
.extract(
'(\w*\s\w*)'
, expand
=
True
)
df[
'Year'
]
=
df[
'movie_data'
].
str
.extract(
'(\d\d\d\d)'
, expand
=
True
)
df[
'Rating'
]
=
df[
'movie_data'
].
str
.extract(
'(\d\.\d)'
, expand
=
True
)
print
(df)
Output:
Similar Reads
- Pandas Exercises and Programs Pandas is an open-source Python Library that is made mainly for working with relational or labelled data both easily and intuitively. This Python library is built on top of the NumPy library, providing various operations and data structures for manipulating numerical data and time series. Pandas is 6 min read
- Different ways to create Pandas Dataframe It is the most commonly used Pandas object. The pd.DataFrame() function is used to create a DataFrame in Pandas. There are several ways to create a Pandas Dataframe in Python. Example: Creating a DataFrame from a Dictionary [GFGTABS] Python import pandas as pd # initialize data of lists. data = { 7 min read