python - Quick way to extract CSS style attributes from html elements -
for machine learning purposes, have html page input, extract style attributes of dom elements. so, here preliminary code:
from selenium import webdriver start = time.time() driver = webdriver.phantomjs() driver.get('example page') elements = driver.find_elements(by.xpath, "//*[not(child::*)]") #select leaf nodes l = {} css_properties=("line-height", "text-align","font-size", "font-style") in elements: if i.text: #print time.time() - end_dl if i.text not in l: l[i.text] = {} el in css_properties: l[i.text][el] = str(i.value_of_css_property(el)) l[i.text]["text_length"] = len(i.text)
the problem code taking long parse features (~8s). can think in faster way this?
are sure it's parsing step that's taking long?
if so, here few options...
- try beautifulsoup4 parsing dom.
- deploy on cloud server faster hardware. use amazon ec2 or digitalocean charges hour.
- deploy on distributed system.
Comments
Post a Comment