Announcement

Collapse
No announcement yet.

Google going down?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Beorn
    replied
    That's almost like thinking it's alright to squeeze the date into 6 digits because you can fit that on a card...

    I'm going to find some data and get to work tomorrow...

    Leave a comment:


  • Kackle
    replied
    You don't need a calculator. An extra byte is 8 extra bits. Each bit doubles the capacity. This is too much extra capacity, but it's more efficient to add a full byte even if you need less than a byte, because the operating system works with bytes when it comes to input-output functions and memory access.

    You don't need 256 times more numbers. I'd go for 3 bits out of the 8 by masking out the ones I'm not using. That gives me 8 times 4.2 billion, or 34 billion total docIDs before I run out of numbers. That should do it.

    The other 5 bits can be used to categorize web pages or improve the algorithms some other way.

    Leave a comment:


  • Beorn
    replied
    An extra byte, in my opinion, wouldn't help for too long. I haven't done the numbers (I don't have my graphing calculator with me...but I could use the new HP one ), but I susepect that web growth is modeled by an exponential regression, not a linear regression. I would go for at least 2 bytes extra.

    I don't have time to read the article, but I will later...and I'll do a bit of stats on my calculator tomorrow.

    Mike

    Leave a comment:


  • Kackle
    replied
    I didn't submit it to Slashdot. Someone else did without asking me first. I wanted Slashdot to ignore it, which is why I put that little GIF on it. I've already been slashdotted twice in the last year, and flamed by legions of slashdotting script kiddies. It's not much fun.

    I dispute your presumption. Remember, Google uses massively parallel configurations for load balancing, and for redundancy and fault toleration, in at least nine different data centers around the world. Whatever bytes you use in the elemental design get multiplied many, many times throughout the entire architecture. It's very clear that Google's engineers are conscious of every wasted bit, and every wasted CPU cycle at the level of software design that we're discussing when referring to the nature of the docID, and the process of accessing the inverted indexes.

    From: http://www.computer.org/micro/mi2003/m2022.pdf
    "The Google Cluster Architecture: Web Search for a Planet" IEEE Micro, March-April, 2003, written by three Google engineers.

    "The search process is challenging because of the large amount of data: The raw documents comprise several tens of terabytes of uncompressed data, and the inverted index resulting from this raw data is itself many terabytes of data."

    Now multiply this many times -- heck, let's say 100 times. At that point, perhaps they have one person looking after every 100 terabytes. But you'd have to also multiply the space required for the docID in the inverted index 100 times.

    In fact, since the activity in the inverted indexes accounts for most of the CPU load, my guess is that you'd need maybe 20 duplicate inverted indexes for every complete set of raw documents. So let's multiply it by 2000 times instead.

    I'm guessing on the numbers, but my basic point remains -- they aren't going to waste bytes on the docID if they don't have to.

    Leave a comment:


  • Beorn
    replied
    Google has not had an IPO

    Kackle, that page is more than a year and a half old. It could easily be outdated. You even said it:
    Originally posted by Kackle
    It's clear that four-byte docIDs were used at one time. It's also clear that increasing beyond four bytes is not trivial. Finally, there is no way they would use 8 bytes if they can use 5 bytes instead. Remember, this docID is used twice for every indexed word on every web page on the web. You want it as efficient as possible.


    As to the /. comment. I assure you that it wasn't directed at you.
    There is rarely anything posted on /. that is not definite and set in stone. Look at today's headlines:
    • Science: Holographic Keypads Float Into View
    • Red Hat Sues SCO, Sets Up Legal Fund
    • Slashdot T-Shirt Contest Winners!
    • iPhoto 2: The Missing Manual
    • frottle: Defeating the Wireless Hidden Node Problem
    • Interviews: Find Out About the Future of Science
    • 4Gb CF Card Announced
    • The Effect of Pirated CDs
    • Novell Buys Ximian
    • MSI's Home Theatre PC Reviewed
    • How's Your Cell Service?
    • New High-End HP Calculator?
    • AMD, Transmeta Edge Up In Market Share
    • X-Prize Overview: To The Edge Of Space, Cheap


    The HP Calc one is the only speculative article. There is very little definite fact compared to the other articles.

    Oh, and 10 TB is nothing. I've 1 TB of storage in my house. I read somewhere that Google has one person assigned to manage every 100TB of data.

    Mike

    Leave a comment:


  • stanmxl
    replied
    Originally posted by Wayne Luke
    Maybe you would. I could not care less about a company's ticker unless I own stock of theirs.
    I'd care. Because I'd know it really is publically traded and I just don't make up facts without verifying.

    Leave a comment:


  • Wayne Luke
    replied
    Originally posted by stanmxl
    There's been talk about them having an IPO for a very, very, very long time, but it's never gone through.

    Else we'd all know Googles ticker by now, right?
    Maybe you would. I could not care less about a company's ticker unless I own stock of theirs.

    Leave a comment:


  • stanmxl
    replied
    Originally posted by Wayne Luke
    Last I heard they had an IPO a few months ago... IPO means Initial Public Offering. Unless that fell through because of SEC violations on Google's part in the filing they would be publicly traded.

    If it didn't go through then the owners can stay in control as long as they don't need the capital that the IPO would bring. More power to them.
    There's been talk about them having an IPO for a very, very, very long time, but it's never gone through.

    Else we'd all know Googles ticker by now, right?

    Leave a comment:


  • tgillespie
    replied
    Originally posted by stanmxl
    I think you don't either.

    Google is not a publically traded company. That is its current market status.

    Do you still think he knows what he's talking about?
    Yes.

    Leave a comment:


  • Wayne Luke
    replied
    Originally posted by stanmxl
    I think you don't either.

    Google is not a publically traded company. That is its current market status.

    Do you still think he knows what he's talking about?
    Last I heard they had an IPO a few months ago... IPO means Initial Public Offering. Unless that fell through because of SEC violations on Google's part in the filing they would be publicly traded.

    If it didn't go through then the owners can stay in control as long as they don't need the capital that the IPO would bring. More power to them.

    Leave a comment:


  • Joe
    replied
    Kackle, Thanks for that post, clears it up quite a bit. I cant wait to see how google over comes this one...

    Leave a comment:


  • stanmxl
    replied
    Originally posted by tgillespie
    I think he does.

    He was probably refering to its market status
    I think you don't either.

    Google is not a publically traded company. That is its current market status.

    Do you still think he knows what he's talking about?
    Last edited by stanmxl; Mon 4th Aug '03, 11:33am.

    Leave a comment:


  • Kackle
    replied
    Beorn says:

    > To tell you the truth, I'm not amazed that page was rejected by /.

    I wrote that page.

    Best estimates are that on the average, each docID is used twice per word per page. That's because they have two inverted indexes. One is "fancy" and the other is "plain."

    The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:

    4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)

    12 bytes: 300 * 4 billion * 24 = 2.88 to 13th power (29 terabytes)

    20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)

    If you were designing a search engine, how many bytes would you choose for your docID?

    No, you wouldn't use 8 bytes. That's wasting 10 terabytes because you can use 4 bytes instead, for 4 billion web pages. A waste of 10 terabytes means your processing slows down, in addition to the fact that you have to spend all that money on RAM or on hard disk space to store your inverted indexes.

    Instead of 8 bytes, you'd use 4 bytes. No question about it. Then if you want to display or preserve the ID somewhere, like in a URL, you would use the 4 bytes to retrieve the long version of the ID, using a conversion algo. You can insert this alphanumeric long version into the cache copy URL and use it to retrieve the 4 bytes later. This conversion process is cheap, because it doesn't have to be done all that often. Certainly not as often as the intensive amount of processing required on the inverted indexes, which are immediately consulted with every new query coming into the front end of Google.

    Take a look at this paper:

    "The Term Vector Database: Fast Access to Indexing Terms for Web Pages" by Raymie Stata, Krishna Bharat, Farzin Maghoul, which can be found at http://www9.org/w9cdrom/159/159.html

    Bharat is a member of the research staff at Google and has a Ph.D. in computer science. Here are some quotes from that paper:

    "Rather than deal directly with URLs, the Connectivity Server uses a set of densely-packed integers to identify pages."

    "Recall that page identifiers are a dense set of integers."

    "To avoid wasting space, we pack vector records densely."

    "Functions in the Connectivity Server convert between these integers and text URLs. In our work with the Connectivity Server, these identifiers have proven more convenient to handle in code than text URLs."

    "Notice the use of integers to represent terms; as with page IDs in the Connectivity Server, we find these to be more convenient to manipulate than text strings."

    "The page ID of the vector is stored in the first 4-bytes of the vector's record."

    It's clear that four-byte docIDs were used at one time. It's also clear that increasing beyond four bytes is not trivial. Finally, there is no way they would use 8 bytes if they can use 5 bytes instead. Remember, this docID is used twice for every indexed word on every web page on the web. You want it as efficient as possible.

    Leave a comment:


  • tgillespie
    replied
    Originally posted by stanmxl
    Do you have the slightest idea about what you're saying?

    Obviously not.
    I think he does.

    He was probably refering to its market status

    Leave a comment:


  • stanmxl
    replied
    Originally posted by Wayne Luke
    Google is publicly traded
    Do you have the slightest idea about what you're saying?

    Obviously not.

    Leave a comment:

widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
Working...
X